HTMLEOF
- name: Publish to ludwig-ai/schema
uses: cpina/github-action-push-to-another-repository@v1.7.2
env:
SSH_DEPLOY_KEY: ${{ secrets.SCHEMA_REPO_DEPLOY_KEY }}
with:
source-directory: schema-out
destination-github-username: ludwig-ai
destination-repository-name: schema
target-branch: main
commit-message: "Update Ludwig JSON schema from ${{ github.sha }}"
================================================
FILE: .github/workflows/test-results.yml
================================================
name: test results
on:
workflow_run:
workflows: ["pytest"]
types:
- completed
jobs:
test-results:
name: Test Results
runs-on: ubuntu-latest
if: github.event.workflow_run.conclusion != 'skipped'
steps:
- name: Download and Extract Artifacts
env:
GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
run: |
mkdir -p artifacts && cd artifacts
artifacts_url=${{ github.event.workflow_run.artifacts_url }}
gh api "$artifacts_url" -q '.artifacts[] | [.name, .archive_download_url] | @tsv' | while read artifact
do
IFS=$'\t' read name url <<< "$artifact"
gh api $url > "$name.zip"
unzip -d "$name" "$name.zip"
done
- name: Publish Unit Test Results
uses: EnricoMi/publish-unit-test-result-action@v2
with:
commit: ${{ github.event.workflow_run.head_sha }}
event_file: artifacts/Event File/event.json
event_name: ${{ github.event.workflow_run.event }}
files: "artifacts/**/*.xml"
================================================
FILE: .github/workflows/upload-pypi.yml
================================================
name: Upload to PyPI
on:
# Triggers the workflow when a release or draft of a release is published,
# or a pre-release is changed to a release
release:
types: [released]
# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
jobs:
pypi-publish:
name: upload release to PyPI
runs-on: ubuntu-latest
# Specifying a GitHub environment is optional, but strongly encouraged
environment: release
permissions:
# IMPORTANT: this permission is mandatory for trusted publishing
id-token: write
steps:
- name: Checkout
uses: actions/checkout@v4
with:
submodules: "recursive"
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Build source distribution
run: |
pip install setuptools
python setup.py sdist
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
================================================
FILE: .gitignore
================================================
###################
# ludwig specific #
###################
*.lock_preprocessing
results/
ludwig/results/
results_*/
ludwig_arm64/
# ailabs-utils
ailabs_util
docker_assets
# data
mnist_data/
profile_images/
./profile_images/
###########
# General #
###########
# Mac stuff
.DS_Store
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
env*
build/
develop-eggs/
dist/
downloads/
./downloads/
./dataset/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Data
*.csv
*.hdf5
*.meta.json
*.parquet
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# dotenv
.env
# virtualenv
.venv
venv*
ENV/
# Spyder project settings
.spyderproject
# Rope project settings
.ropeproject
# PyCharm
.idea
# ctags
tags
# examples
examples/*/data/
examples/*/visualizations/
# benchmarking configs
ludwig/benchmarking/configs/
# Aim tracking
.aim/
# Comet
.comet.config
# Test-generated artifacts (image/audio features)
*.png
*.wav
generated_audio/
generated_images/
================================================
FILE: .nojekyll
================================================
================================================
FILE: .pre-commit-config.yaml
================================================
# Apply to all files without committing:
# pre-commit run --all-files
# Apply to changed files:
# pre-commit run
# Update this file:
# pre-commit autoupdate
# Run a specific hook:
# pre-commit run
ci:
autofix_prs: true
autoupdate_commit_msg: "[pre-commit.ci] pre-commit suggestions"
autoupdate_schedule: weekly
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v6.0.0
hooks:
- id: check-ast
- id: fix-byte-order-marker
- id: check-case-conflict
- id: check-executables-have-shebangs
- id: check-json
- id: check-toml
- id: check-yaml
- id: debug-statements
- id: detect-private-key
- id: end-of-file-fixer
- id: trailing-whitespace
- id: mixed-line-ending
- repo: https://github.com/asottile/pyupgrade
rev: v3.21.2
hooks:
- id: pyupgrade
args: [--py312-plus]
- repo: https://github.com/PyCQA/docformatter
rev: v1.7.7
hooks:
- id: docformatter
args: [--in-place, --wrap-summaries=115, --wrap-descriptions=120]
- repo: https://github.com/PyCQA/isort
rev: 8.0.0
hooks:
- id: isort
name: Format imports
- repo: https://github.com/pycqa/flake8
rev: 7.3.0
hooks:
- id: flake8
- repo: https://github.com/psf/black
rev: 26.1.0
hooks:
- id: black
name: Format code
- repo: https://github.com/asottile/blacken-docs
rev: 1.20.0
hooks:
- id: blacken-docs
args: [--line-length=120]
- repo: https://github.com/hukkin/mdformat
rev: 1.0.0
hooks:
- id: mdformat
additional_dependencies:
- mdformat-gfm==1.0.0
- mdformat_frontmatter==2.0.10
exclude: CHANGELOG.md
- repo: https://github.com/yoheimuta/protolint
rev: v0.56.4
hooks:
- id: protolint
================================================
FILE: .protolint.yaml
================================================
# Adapted from
# https://github.com/yoheimuta/protolint/blob/master/_example/config/.protolint.yaml
---
# Lint directives.
lint:
# Linter files to walk.
files:
# The specific files to exclude.
exclude:
# NOTE: UNIX paths will be properly accepted by both UNIX and Windows.
- ../proto/invalidFileName.proto
# Linter rules.
# Run `protolint list` to see all available rules.
rules:
# Set the default to all linters. This option works the other way around as no_default does.
# If you want to enable this option, delete the comment out below and no_default.
# all_default: true
# The specific linters to add.
add:
- FIELD_NAMES_LOWER_SNAKE_CASE
- MESSAGE_NAMES_UPPER_CAMEL_CASE
- MAX_LINE_LENGTH
- INDENT
- FIELD_NAMES_EXCLUDE_PREPOSITIONS
- FILE_NAMES_LOWER_SNAKE_CASE
- IMPORTS_SORTED
- PACKAGE_NAME_LOWER_CASE
- ORDER
- PROTO3_FIELDS_AVOID_REQUIRED
- PROTO3_GROUPS_AVOID
- REPEATED_FIELD_NAMES_PLURALIZED
- QUOTE_CONSISTENT
# Linter rules option.
rules_option:
# MAX_LINE_LENGTH rule option.
max_line_length:
# Enforces a maximum line length
max_chars: 120
# Specifies the character count for tab characters
tab_chars: 2
# FILE_NAMES_LOWER_SNAKE_CASE rule option.
file_names_lower_snake_case:
excludes:
- ../proto/invalidFileName.proto
# QUOTE_CONSISTENT rule option.
quote_consistent:
# Available quote are "double" or "single".
quote: double
================================================
FILE: .vscode/settings.json
================================================
{
"editor.rulers": [
120
],
"editor.formatOnSave": true,
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.formatOnSave": true
},
"black-formatter.args": [
"--line-length",
"120"
],
"flake8.args": [
"--config=setup.cfg"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": false,
"python.envFile": "${workspaceFolder}/.env"
}
================================================
FILE: CODEOWNERS
================================================
# Default code owners for the entire repository
* @w4nderlust @tgaddair @justinxzhao @arnavgarg1 @geoffreyangus @jeffkinnison @Infernaught @alexsherstinsky
================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Code of conduct
Ludwig adopts the [Linux Foundation code of conduct](https://lfprojects.org/policies/code-of-conduct/).
================================================
FILE: CONTRIBUTING.md
================================================
# Contributing
Everyone is welcome to contribute, and we value everybody’s contribution. Code is thus not the only
way to help the community. Answering questions, helping others, reaching out and improving the
documentation are immensely valuable contributions as well.
It also helps us if you spread the word: reference the library from blog posts on the awesome
projects it made possible, shout out on X every time it has helped you, or simply star the
repo to say "thank you".
Check out the official [ludwig docs](https://ludwig-ai.github.io/ludwig-docs/) to get oriented
around the codebase, and join the community!
## Open Issues
Issues are listed at:
If you would like to work on any of them, make sure it is not already assigned to someone else.
You can self-assign it by commenting on the Issue page with one of the keywords: `#take` or
`#self-assign`.
Work on your self-assigned issue and eventually create a Pull Request.
## Creating Pull Requests
1. Fork the [repository](https://github.com/ludwig-ai/ludwig) by clicking on the "Fork" button on
the repository's page. This creates a copy of the code under your GitHub user account.
1. Clone your fork to your local disk, and add the base repository as a remote:
```bash
git clone git@github.com:/ludwig.git
cd ludwig
git remote add upstream https://github.com/ludwig-ai/ludwig.git
```
1. Create a new branch to hold your development changes:
```bash
git checkout -b a-descriptive-name-for-my-changes
```
*Do not*\* work on the `master` branch.
1. Set up a development environment by running the following command in a virtual environment:
```bash
pip install -e .
```
The above command will install only the packages in "requirements.txt" in the developer mode. If you would like to
be able to potentially make changes to the overall Ludwig codebase, then use the following command:
```bash
pip install -e .[full]
```
Please note that in certain Shell environments (e.g., the `Z shell`), the dependencies in brackets have to be quoted:
```bash
pip install -e ."[full]"
```
If you do not need access to the entire Ludwig codebase, but just want to be able to run `pytest` on the essential
functionality, then you would replace the above command with:
```bash
pip install -e .[test]
```
(Please use `pip install -e ."[test]"` where your Shell environment requires quotes around the square brackets.)
For the full list of the optional dependencies available in Ludwig, please see
[Installation Guide](https://ludwig.ai/latest/getting_started/installation/) and "setup.py" in the root of the Ludwig
repository.
1. On MacOS with Apple Silicon, if this installation approach runs into errors, you may need to install the following
prerequisites:
```bash
brew install cmake libomp
```
This step requires `homebrew` to be installed on your development machine.
1. Install and run `pre-commit`:
```bash
pip install pre-commit
pre-commit install
```
1. Develop features on your branch.
1. Format your code by running pre-commits so that your newly added files look nice:
```bash
pre-commit run
```
Pre-commits also run automatically when committing.
1. Once you're happy with your changes, make a commit to record your changes locally:
```bash
git add .
git commit
```
It is a good idea to sync your copy of the code with the original repository regularly. This
way you can quickly account for changes:
```bash
git fetch upstream
git rebase upstream/master
```
Push the changes to your account using:
```bash
git push -u origin a-descriptive-name-for-my-changes
```
1. Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send
your contribution to the project maintainers for review.
## Other tips
- Add unit tests for any new code you write.
- Make sure tests pass. See the [Developer Guide](https://ludwig-ai.github.io/ludwig-docs/latest/developer_guide/style_guidelines_and_tests/) for more details.
## Attribution
This contributing guideline is adapted from `huggingface`, available at .
## Code of Conduct
Please be mindful of and adhere to the Linux Foundation's
[Code of Conduct](https://lfprojects.org/policies/code-of-conduct) when contributing to Ludwig.
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------
Code in ludwig/api_annotations.py adapted from
https://github.com/ray-project/ray (Apache-2.0 License)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------
Code in ludwig/utils/structural_warnings.py adapted from
https://github.com/ray-project/ray (Apache-2.0 License)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------
Code in ludwig/utils/logging_utils.py adapted from
https://github.com/ray-project/ray (Apache-2.0 License)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: MANIFEST.in
================================================
include *.txt
recursive-include ludwig/datasets *.yaml
recursive-include ludwig/automl/defaults *.yaml
recursive-include ludwig/schema/metadata/configs *.yaml
================================================
FILE: NOTICE
================================================
Ludwig includes derived work from TensorFlow(https://github.com/tensorflow/tensorflow) under the Apache License 2.0:
Copyright 2016 The prometheus-operator Authors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
The derived work can be found in the files: ludwig/models/modules/convolutional_modules.py
------
Ludwig includes derived work from Keras(https://github.com/keras-team/keras) under the MIT License:
COPYRIGHT
All contributions by François Chollet:
Copyright (c) 2015 - 2018, François Chollet.
All rights reserved.
All contributions by Google:
Copyright (c) 2015 - 2018, Google, Inc.
All rights reserved.
All contributions by Microsoft:
Copyright (c) 2017 - 2018, Microsoft, Inc.
All rights reserved.
All other contributions:
Copyright (c) 2015 - 2018, the respective contributors.
All rights reserved.
Each contributor holds copyright over their respective contributions.
The project versioning (Git) records all such contribution source information.
LICENSE
The MIT License (MIT)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
The derived work can be found in the files: mkdocs/code_docs_autogen.py
================================================
FILE: README.md
================================================
_Declarative deep learning framework built for scale and efficiency._
[](https://badge.fury.io/py/ludwig)
[](https://discord.gg/CBgdrGnZjy)
[](https://hub.docker.com/r/ludwigai)
[](https://pepy.tech/project/ludwig)
[](https://github.com/ludwig-ai/ludwig/blob/main/LICENSE)
[](https://twitter.com/ludwig_ai)
# 📖 What is Ludwig?
Ludwig is a **low-code** framework for building **custom** AI models like **LLMs** and other deep neural networks.
Key features:
- 🛠 **Build custom models with ease:** a declarative YAML configuration file is all you need to train a state-of-the-art LLM on your data. Support for multi-task and multi-modality learning. Comprehensive config validation detects invalid parameter combinations and prevents runtime failures.
- ⚡ **Optimized for scale and efficiency:** automatic batch size selection, distributed training ([DDP](https://pytorch.org/tutorials/beginner/ddp_series_theory.html), [DeepSpeed](https://github.com/microsoft/DeepSpeed)), parameter efficient fine-tuning ([PEFT](https://github.com/huggingface/peft)), 4-bit quantization (QLoRA), paged and 8-bit optimizers, and larger-than-memory datasets.
- 📐 **Expert level control:** retain full control of your models down to the activation functions. Support for hyperparameter optimization, explainability, and rich metric visualizations.
- 🧱 **Modular and extensible:** experiment with different model architectures, tasks, features, and modalities with just a few parameter changes in the config. Think building blocks for deep learning.
- 🚢 **Engineered for production:** prebuilt [Docker](https://hub.docker.com/u/ludwigai) containers, native support for running with [Ray](https://www.ray.io/) on [Kubernetes](https://github.com/ray-project/kuberay), export models to [Torchscript](https://pytorch.org/docs/stable/jit.html) and [Triton](https://developer.nvidia.com/triton-inference-server), upload to [HuggingFace](https://huggingface.co/models) with one command.
Ludwig is hosted by the
[Linux Foundation AI & Data](https://lfaidata.foundation/).
**Tech stack:** Python 3.12 | PyTorch 2.6 | Pydantic 2 | Transformers 5 | Ray 2.54

# 💾 Installation
Install from PyPI. Be aware that Ludwig requires Python 3.12+.
```shell
pip install ludwig
```
Or install with all optional dependencies:
```shell
pip install ludwig[full]
```
Please see [contributing](https://github.com/ludwig-ai/ludwig/blob/main/CONTRIBUTING.md) for more detailed installation instructions.
# 🚂 Getting Started
Want to take a quick peek at some of Ludwig's features? Check out this Colab Notebook 🚀 [](https://colab.research.google.com/drive/1lB4ALmEyvcMycE3Mlnsd7I3bc0zxvk39)
Looking to fine-tune LLMs? Check out these notebooks:
1. Fine-Tune Llama-2-7b: [](https://colab.research.google.com/drive/1r4oSEwRJpYKBPM0M0RSh0pBEYK_gBKbe)
1. Fine-Tune Llama-2-13b: [](https://colab.research.google.com/drive/1zmSEzqZ7v4twBrXagj1TE_C--RNyVAyu)
1. Fine-Tune Mistral-7b: [](https://colab.research.google.com/drive/1i_8A1n__b7ljRWHzIsAdhO7u7r49vUm4)
For a full tutorial, check out the official [getting started guide](https://ludwig.ai/latest/getting_started/), or take a look at end-to-end [Examples](https://ludwig.ai/latest/examples).
## Large Language Model Fine-Tuning
[](https://colab.research.google.com/drive/1c3AO8l_H6V_x37RwQ8V7M6A-RmcBf2tG?usp=sharing)
Let's fine-tune a pretrained LLM to follow instructions like a chatbot ("instruction tuning").
### Prerequisites
- [HuggingFace API Token](https://huggingface.co/docs/hub/security-tokens)
- Access approval to your chosen base model (e.g., [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B))
- GPU with at least 12 GiB of VRAM (in our tests, we used an Nvidia T4)
### Running
We'll use the [Stanford Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) dataset, which will be formatted as a table-like file that looks like this:
| instruction | input | output |
| :-----------------------------------------------: | :--------------: | :-----------------------------------------------: |
| Give three tips for staying healthy. | | 1.Eat a balanced diet and make sure to include... |
| Arrange the items given below in the order to ... | cake, me, eating | I eating cake. |
| Write an introductory paragraph about a famous... | Michelle Obama | Michelle Obama is an inspirational woman who r... |
| ... | ... | ... |
Create a YAML config file named `model.yaml` with the following:
```yaml
model_type: llm
base_model: meta-llama/Llama-3.1-8B
quantization:
bits: 4
adapter:
type: lora
prompt:
template: |
Below is an instruction that describes a task, paired with an input that may provide further context.
Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
input_features:
- name: prompt
type: text
output_features:
- name: output
type: text
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
gradient_accumulation_steps: 16
epochs: 3
learning_rate_scheduler:
decay: cosine
warmup_fraction: 0.01
preprocessing:
sample_ratio: 0.1
backend:
type: local
```
And now let's train the model:
```bash
export HUGGING_FACE_HUB_TOKEN = ""
ludwig train --config model.yaml --dataset "ludwig://alpaca"
```
## Supervised ML
Let's build a neural network that predicts whether a given movie critic's review on [Rotten Tomatoes](https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset) was positive or negative.
Our dataset will be a CSV file that looks like this:
| movie_title | content_rating | genres | runtime | top_critic | review_content | recommended |
| :------------------: | :------------: | :------------------------------: | :-----: | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- |
| Deliver Us from Evil | R | Action & Adventure, Horror | 117.0 | TRUE | Director Scott Derrickson and his co-writer, Paul Harris Boardman, deliver a routine procedural with unremarkable frights. | 0 |
| Barbara | PG-13 | Art House & International, Drama | 105.0 | FALSE | Somehow, in this stirring narrative, Barbara manages to keep hold of her principles, and her humanity and courage, and battles to save a dissident teenage girl whose life the Communists are trying to destroy. | 1 |
| Horrible Bosses | R | Comedy | 98.0 | FALSE | These bosses cannot justify either murder or lasting comic memories, fatally compromising a farce that could have been great but ends up merely mediocre. | 0 |
| ... | ... | ... | ... | ... | ... | ... |
Download a sample of the dataset from [here](https://ludwig.ai/latest/data/rotten_tomatoes.csv).
```bash
wget https://ludwig.ai/latest/data/rotten_tomatoes.csv
```
Next create a YAML config file named `model.yaml` with the following:
```yaml
input_features:
- name: genres
type: set
preprocessing:
tokenizer: comma
- name: content_rating
type: category
- name: top_critic
type: binary
- name: runtime
type: number
- name: review_content
type: text
encoder:
type: embed
output_features:
- name: recommended
type: binary
```
That's it! Now let's train the model:
```bash
ludwig train --config model.yaml --dataset rotten_tomatoes.csv
```
**Happy modeling**
Try applying Ludwig to your data. [Reach out on Discord](https://discord.gg/CBgdrGnZjy)
if you have any questions.
# ❓ Why you should use Ludwig
- **Minimal machine learning boilerplate**
Ludwig takes care of the engineering complexity of machine learning out of
the box, enabling research scientists to focus on building models at the
highest level of abstraction. Data preprocessing, hyperparameter
optimization, device management, and distributed training for
`torch.nn.Module` models come completely free.
- **Easily build your benchmarks**
Creating a state-of-the-art baseline and comparing it with a new model is a
simple config change.
- **Easily apply new architectures to multiple problems and datasets**
Apply new models across the extensive set of tasks and datasets that Ludwig
supports. Ludwig includes a
[full benchmarking toolkit](https://arxiv.org/abs/2111.04260) accessible to
any user, for running experiments with multiple models across multiple
datasets with just a simple configuration.
- **Highly configurable data preprocessing, modeling, and metrics**
Any and all aspects of the model architecture, training loop, hyperparameter
search, and backend infrastructure can be modified as additional fields in
the declarative configuration to customize the pipeline to meet your
requirements. For details on what can be configured, check out
[Ludwig Configuration](https://ludwig.ai/latest/configuration/)
docs.
- **Multi-modal, multi-task learning out-of-the-box**
Mix and match tabular data, text, images, and even audio into complex model
configurations without writing code.
- **Rich model exporting and tracking**
Automatically track all trials and metrics with tools like Tensorboard,
Comet ML, Weights & Biases, MLFlow, and Aim Stack.
- **Automatically scale training to multi-GPU, multi-node clusters**
Go from training on your local machine to the cloud without code changes.
- **Low-code interface for state-of-the-art models, including pre-trained Huggingface Transformers**
Ludwig also natively integrates with pre-trained models, such as the ones
available in [Huggingface Transformers](https://huggingface.co/docs/transformers/index).
Users can choose from a vast collection of state-of-the-art pre-trained
PyTorch models to use without needing to write any code at all. For example,
training a BERT-based sentiment analysis model with Ludwig is as simple as:
```shell
ludwig train --dataset sst5 --config_str "{input_features: [{name: sentence, type: text, encoder: bert}], output_features: [{name: label, type: category}]}"
```
- **Low-code interface for AutoML**
[Ludwig AutoML](https://ludwig.ai/latest/user_guide/automl/)
allows users to obtain trained models by providing just a dataset, the
target column, and a time budget.
```python
auto_train_results = ludwig.automl.auto_train(dataset=my_dataset_df, target=target_column_name, time_limit_s=7200)
```
- **Easy productionisation**
Ludwig makes it easy to serve deep learning models, including on GPUs.
Launch a REST API for your trained Ludwig model.
```shell
ludwig serve --model_path=/path/to/model
```
Ludwig supports exporting models to efficient Torchscript bundles.
```shell
ludwig export_torchscript -–model_path=/path/to/model
```
# 📚 Tutorials
- [Text Classification](https://ludwig.ai/latest/examples/text_classification)
- [Tabular Data Classification](https://ludwig.ai/latest/examples/adult_census_income)
- [Image Classification](https://ludwig.ai/latest/examples/mnist)
- [Multimodal Classification](https://ludwig.ai/latest/examples/multimodal_classification)
# 🔬 Example Use Cases
- [Named Entity Recognition Tagging](https://ludwig.ai/latest/examples/ner_tagging)
- [Natural Language Understanding](https://ludwig.ai/latest/examples/nlu)
- [Machine Translation](https://ludwig.ai/latest/examples/machine_translation)
- [Chit-Chat Dialogue Modeling through seq2seq](https://ludwig.ai/latest/examples/seq2seq)
- [Sentiment Analysis](https://ludwig.ai/latest/examples/sentiment_analysis)
- [One-shot Learning with Siamese Networks](https://ludwig.ai/latest/examples/oneshot)
- [Visual Question Answering](https://ludwig.ai/latest/examples/visual_qa)
- [Spoken Digit Speech Recognition](https://ludwig.ai/latest/examples/speech_recognition)
- [Speaker Verification](https://ludwig.ai/latest/examples/speaker_verification)
- [Binary Classification (Titanic)](https://ludwig.ai/latest/examples/titanic)
- [Timeseries forecasting](https://ludwig.ai/latest/examples/forecasting)
- [Timeseries forecasting (Weather)](https://ludwig.ai/latest/examples/weather)
- [Movie rating prediction](https://ludwig.ai/latest/examples/movie_ratings)
- [Multi-label classification](https://ludwig.ai/latest/examples/multi_label)
- [Multi-Task Learning](https://ludwig.ai/latest/examples/multi_task)
- [Simple Regression: Fuel Efficiency Prediction](https://ludwig.ai/latest/examples/fuel_efficiency)
- [Fraud Detection](https://ludwig.ai/latest/examples/fraud)
# 💡 More Information
Read our publications on [Ludwig](https://arxiv.org/pdf/1909.07930.pdf), [declarative ML](https://arxiv.org/pdf/2107.08148.pdf), and [Ludwig’s SoTA benchmarks](https://openreview.net/pdf?id=hwjnu6qW7E4).
Learn more about [how Ludwig works](https://ludwig.ai/latest/user_guide/how_ludwig_works/), [how to get started](https://ludwig.ai/latest/getting_started/), and work through more [examples](https://ludwig.ai/latest/examples).
If you are interested in [contributing](https://github.com/ludwig-ai/ludwig/blob/main/CONTRIBUTING.md), have questions, comments, or thoughts to share, or if you just want to be in the
know, please consider [joining our Community Discord](https://discord.gg/CBgdrGnZjy) and follow us on [X](https://twitter.com/ludwig_ai)!
# 🤝 Join the community to build Ludwig with us
Ludwig is an actively managed open-source project that relies on contributions from folks just like
you. Consider joining the active group of Ludwig contributors to make Ludwig an even
more accessible and feature rich framework for everyone to use!
## Star History
[](https://star-history.com/#ludwig-ai/ludwig&Date)
# 👋 Getting Involved
- [Discord](https://discord.gg/CBgdrGnZjy)
- [X (Twitter)](https://twitter.com/ludwig_ai)
- [Medium](https://medium.com/ludwig-ai)
- [GitHub Issues](https://github.com/ludwig-ai/ludwig/issues)
================================================
FILE: README_KR.md
================================================
_확장성과 효율성을 위해 설계된 선언적 딥러닝 프레임워크_
[](https://badge.fury.io/py/ludwig)
[](https://discord.gg/CBgdrGnZjy)
[](https://hub.docker.com/r/ludwigai)
[](https://pepy.tech/project/ludwig)
[](https://github.com/ludwig-ai/ludwig/blob/main/LICENSE)
[](https://twitter.com/ludwig_ai)
# 📖 Ludwig란?
Ludwig는 **LLM** 및 기타 심층 신경망과 같은 **맞춤형** AI 모델을 구축하기 위한 **로우코드** 프레임워크입니다.
주요 기능:
- 🛠 **손쉬운 맞춤형 모델 구축:** 선언적 YAML 설정 파일만으로 최신 LLM을 데이터에 맞춰 학습시킬 수 있습니다. 멀티태스크 및 멀티모달 학습을 지원합니다. 포괄적인 설정 검증으로 잘못된 매개변수 조합을 감지하고 런타임 오류를 방지합니다.
- ⚡ **확장성과 효율성 최적화:** 자동 배치 크기 선택, 분산 학습([DDP](https://pytorch.org/tutorials/beginner/ddp_series_theory.html), [DeepSpeed](https://github.com/microsoft/DeepSpeed)), 매개변수 효율적 미세 조정([PEFT](https://github.com/huggingface/peft)), 4비트 양자화(QLoRA), 페이지 및 8비트 옵티마이저, 메모리 초과 데이터셋 지원.
- 📐 **전문가 수준의 제어:** 활성화 함수까지 모델을 완전히 제어할 수 있습니다. 하이퍼파라미터 최적화, 설명 가능성, 풍부한 메트릭 시각화를 지원합니다.
- 🧱 **모듈식 및 확장 가능:** 설정에서 몇 가지 매개변수만 변경하여 다양한 모델 아키텍처, 태스크, 피처, 모달리티를 실험할 수 있습니다. 딥러닝을 위한 빌딩 블록이라고 생각하세요.
- 🚢 **프로덕션을 위한 설계:** 사전 빌드된 [Docker](https://hub.docker.com/u/ludwigai) 컨테이너, [Kubernetes](https://github.com/ray-project/kuberay)에서 [Ray](https://www.ray.io/) 실행 네이티브 지원, [Torchscript](https://pytorch.org/docs/stable/jit.html) 및 [Triton](https://developer.nvidia.com/triton-inference-server)으로 모델 내보내기, 한 번의 명령으로 [HuggingFace](https://huggingface.co/models)에 업로드.
Ludwig는 [Linux Foundation AI & Data](https://lfaidata.foundation/)에서 호스팅합니다.
**기술 스택:** Python 3.12 | PyTorch 2.6 | Pydantic 2 | Transformers 5 | Ray 2.54

# 💾 설치
PyPI에서 설치합니다. Ludwig는 Python 3.12 이상을 요구합니다.
```shell
pip install ludwig
```
모든 선택적 의존성을 포함하여 설치:
```shell
pip install ludwig[full]
```
더 자세한 설치 방법은 [기여 가이드](https://github.com/ludwig-ai/ludwig/blob/main/CONTRIBUTING.md)를 참조하세요.
# 🚂 시작하기
Ludwig의 기능을 빠르게 살펴보고 싶으시다면 이 Colab 노트북을 확인하세요 🚀 [](https://colab.research.google.com/drive/1lB4ALmEyvcMycE3Mlnsd7I3bc0zxvk39)
LLM 미세 조정을 원하시나요? 다음 노트북을 확인하세요:
1. Fine-Tune Llama-2-7b: [](https://colab.research.google.com/drive/1r4oSEwRJpYKBPM0M0RSh0pBEYK_gBKbe)
1. Fine-Tune Llama-2-13b: [](https://colab.research.google.com/drive/1zmSEzqZ7v4twBrXagj1TE_C--RNyVAyu)
1. Fine-Tune Mistral-7b: [](https://colab.research.google.com/drive/1i_8A1n__b7ljRWHzIsAdhO7u7r49vUm4)
전체 튜토리얼은 공식 [시작 가이드](https://ludwig.ai/latest/getting_started/)를 확인하시거나, 엔드투엔드 [예제](https://ludwig.ai/latest/examples)를 살펴보세요.
## 대규모 언어 모델 미세 조정
[](https://colab.research.google.com/drive/1c3AO8l_H6V_x37RwQ8V7M6A-RmcBf2tG?usp=sharing)
사전 학습된 LLM을 챗봇처럼 지시를 따르도록 미세 조정("인스트럭션 튜닝")해 봅시다.
### 사전 요구 사항
- [HuggingFace API 토큰](https://huggingface.co/docs/hub/security-tokens)
- 선택한 베이스 모델에 대한 접근 승인 (예: [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B))
- 최소 12 GiB VRAM의 GPU (테스트에서는 Nvidia T4를 사용했습니다)
### 실행
[Stanford Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) 데이터셋을 사용합니다. 다음과 같은 테이블 형식의 파일로 구성됩니다:
| instruction | input | output |
| :-----------------------------------------------: | :--------------: | :-----------------------------------------------: |
| Give three tips for staying healthy. | | 1.Eat a balanced diet and make sure to include... |
| Arrange the items given below in the order to ... | cake, me, eating | I eating cake. |
| Write an introductory paragraph about a famous... | Michelle Obama | Michelle Obama is an inspirational woman who r... |
| ... | ... | ... |
`model.yaml`이라는 YAML 설정 파일을 다음 내용으로 생성하세요:
```yaml
model_type: llm
base_model: meta-llama/Llama-3.1-8B
quantization:
bits: 4
adapter:
type: lora
prompt:
template: |
Below is an instruction that describes a task, paired with an input that may provide further context.
Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
input_features:
- name: prompt
type: text
output_features:
- name: output
type: text
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
gradient_accumulation_steps: 16
epochs: 3
learning_rate_scheduler:
decay: cosine
warmup_fraction: 0.01
preprocessing:
sample_ratio: 0.1
backend:
type: local
```
이제 모델을 학습시켜 봅시다:
```bash
export HUGGING_FACE_HUB_TOKEN = ""
ludwig train --config model.yaml --dataset "ludwig://alpaca"
```
## 지도 학습 ML
[Rotten Tomatoes](https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset) 영화 평론가의 리뷰가 긍정적인지 부정적인지 예측하는 신경망을 만들어 봅시다.
데이터셋은 다음과 같은 CSV 파일입니다:
| movie_title | content_rating | genres | runtime | top_critic | review_content | recommended |
| :------------------: | :------------: | :------------------------------: | :-----: | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- |
| Deliver Us from Evil | R | Action & Adventure, Horror | 117.0 | TRUE | Director Scott Derrickson and his co-writer, Paul Harris Boardman, deliver a routine procedural with unremarkable frights. | 0 |
| Barbara | PG-13 | Art House & International, Drama | 105.0 | FALSE | Somehow, in this stirring narrative, Barbara manages to keep hold of her principles, and her humanity and courage, and battles to save a dissident teenage girl whose life the Communists are trying to destroy. | 1 |
| Horrible Bosses | R | Comedy | 98.0 | FALSE | These bosses cannot justify either murder or lasting comic memories, fatally compromising a farce that could have been great but ends up merely mediocre. | 0 |
| ... | ... | ... | ... | ... | ... | ... |
[여기](https://ludwig.ai/latest/data/rotten_tomatoes.csv)에서 데이터셋 샘플을 다운로드하세요.
```bash
wget https://ludwig.ai/latest/data/rotten_tomatoes.csv
```
다음으로 `model.yaml`이라는 YAML 설정 파일을 생성하세요:
```yaml
input_features:
- name: genres
type: set
preprocessing:
tokenizer: comma
- name: content_rating
type: category
- name: top_critic
type: binary
- name: runtime
type: number
- name: review_content
type: text
encoder:
type: embed
output_features:
- name: recommended
type: binary
```
이게 전부입니다! 이제 모델을 학습시켜 봅시다:
```bash
ludwig train --config model.yaml --dataset rotten_tomatoes.csv
```
**즐거운 모델링 되세요**
Ludwig를 여러분의 데이터에 적용해 보세요. 질문이 있으시면 [Discord에서 문의](https://discord.gg/CBgdrGnZjy)해 주세요.
# ❓ Ludwig를 사용해야 하는 이유
- **최소한의 머신러닝 보일러플레이트**
Ludwig는 머신러닝의 엔지니어링 복잡성을 기본으로 처리하여, 연구자들이 가장 높은 수준의 추상화에서 모델 구축에 집중할 수 있게 합니다. `torch.nn.Module` 모델에 대한 데이터 전처리, 하이퍼파라미터 최적화, 디바이스 관리, 분산 학습이 완전히 무료로 제공됩니다.
- **손쉬운 벤치마크 구축**
최신 기준 모델을 만들고 새 모델과 비교하는 것이 간단한 설정 변경만으로 가능합니다.
- **새로운 아키텍처를 여러 문제와 데이터셋에 쉽게 적용**
Ludwig가 지원하는 광범위한 태스크 및 데이터셋 세트에 새 모델을 적용하세요. Ludwig에는 간단한 설정만으로 여러 데이터셋에서 여러 모델 실험을 실행할 수 있는 [전체 벤치마킹 도구](https://arxiv.org/abs/2111.04260)가 모든 사용자에게 제공됩니다.
- **데이터 전처리, 모델링, 메트릭의 높은 설정 가능성**
모델 아키텍처, 학습 루프, 하이퍼파라미터 검색, 백엔드 인프라의 모든 측면을 선언적 설정에서 추가 필드로 수정하여 파이프라인을 요구 사항에 맞게 커스터마이즈할 수 있습니다. 설정 가능한 항목에 대한 자세한 내용은 [Ludwig 설정](https://ludwig.ai/latest/configuration/) 문서를 확인하세요.
- **멀티모달, 멀티태스크 학습 기본 지원**
코드 작성 없이 테이블 데이터, 텍스트, 이미지, 오디오까지 복잡한 모델 설정으로 혼합하여 사용할 수 있습니다.
- **풍부한 모델 내보내기 및 추적**
Tensorboard, Comet ML, Weights & Biases, MLFlow, Aim Stack 등의 도구로 모든 시도와 메트릭을 자동으로 추적합니다.
- **멀티 GPU, 멀티 노드 클러스터로 학습 자동 확장**
로컬 머신에서 클라우드로 코드 변경 없이 전환할 수 있습니다.
- **사전 학습된 Huggingface Transformers를 포함한 최신 모델의 로우코드 인터페이스**
Ludwig는 [Huggingface Transformers](https://huggingface.co/docs/transformers/index)에서 제공하는 사전 학습된 모델과 네이티브로 통합됩니다. 사용자는 코드를 전혀 작성하지 않고도 방대한 최신 사전 학습 PyTorch 모델을 사용할 수 있습니다. 예를 들어, Ludwig로 BERT 기반 감성 분석 모델을 학습시키는 것은 다음과 같이 간단합니다:
```shell
ludwig train --dataset sst5 --config_str "{input_features: [{name: sentence, type: text, encoder: bert}], output_features: [{name: label, type: category}]}"
```
- **AutoML을 위한 로우코드 인터페이스**
[Ludwig AutoML](https://ludwig.ai/latest/user_guide/automl/)을 사용하면 데이터셋, 대상 컬럼, 시간 예산만 제공하여 학습된 모델을 얻을 수 있습니다.
```python
auto_train_results = ludwig.automl.auto_train(dataset=my_dataset_df, target=target_column_name, time_limit_s=7200)
```
- **손쉬운 프로덕션화**
Ludwig는 GPU를 포함한 딥러닝 모델 서빙을 쉽게 만들어 줍니다. 학습된 Ludwig 모델에 대한 REST API를 실행하세요.
```shell
ludwig serve --model_path=/path/to/model
```
Ludwig는 효율적인 Torchscript 번들로 모델 내보내기를 지원합니다.
```shell
ludwig export_torchscript --model_path=/path/to/model
```
# 📚 튜토리얼
- [텍스트 분류](https://ludwig.ai/latest/examples/text_classification)
- [테이블 데이터 분류](https://ludwig.ai/latest/examples/adult_census_income)
- [이미지 분류](https://ludwig.ai/latest/examples/mnist)
- [멀티모달 분류](https://ludwig.ai/latest/examples/multimodal_classification)
# 🔬 예제 사용 사례
- [개체명 인식 태깅](https://ludwig.ai/latest/examples/ner_tagging)
- [자연어 이해](https://ludwig.ai/latest/examples/nlu)
- [기계 번역](https://ludwig.ai/latest/examples/machine_translation)
- [seq2seq를 통한 대화 모델링](https://ludwig.ai/latest/examples/seq2seq)
- [감성 분석](https://ludwig.ai/latest/examples/sentiment_analysis)
- [시아미즈 네트워크를 이용한 원샷 학습](https://ludwig.ai/latest/examples/oneshot)
- [시각적 질의응답](https://ludwig.ai/latest/examples/visual_qa)
- [음성 숫자 인식](https://ludwig.ai/latest/examples/speech_recognition)
- [화자 인증](https://ludwig.ai/latest/examples/speaker_verification)
- [이진 분류 (타이타닉)](https://ludwig.ai/latest/examples/titanic)
- [시계열 예측](https://ludwig.ai/latest/examples/forecasting)
- [시계열 예측 (날씨)](https://ludwig.ai/latest/examples/weather)
- [영화 평점 예측](https://ludwig.ai/latest/examples/movie_ratings)
- [다중 레이블 분류](https://ludwig.ai/latest/examples/multi_label)
- [멀티태스크 학습](https://ludwig.ai/latest/examples/multi_task)
- [단순 회귀: 연비 예측](https://ludwig.ai/latest/examples/fuel_efficiency)
- [사기 탐지](https://ludwig.ai/latest/examples/fraud)
# 💡 추가 정보
[Ludwig](https://arxiv.org/pdf/1909.07930.pdf), [선언적 ML](https://arxiv.org/pdf/2107.08148.pdf), [Ludwig의 SoTA 벤치마크](https://openreview.net/pdf?id=hwjnu6qW7E4)에 대한 논문을 읽어보세요.
[Ludwig의 작동 방식](https://ludwig.ai/latest/user_guide/how_ludwig_works/), [시작 가이드](https://ludwig.ai/latest/getting_started/), 더 많은 [예제](https://ludwig.ai/latest/examples)를 확인하세요.
[기여](https://github.com/ludwig-ai/ludwig/blob/main/CONTRIBUTING.md)에 관심이 있으시거나, 질문, 의견, 공유하고 싶은 생각이 있으시거나, 최신 정보를 받고 싶으시다면 [Discord 커뮤니티에 참여](https://discord.gg/CBgdrGnZjy)하시고 [X](https://twitter.com/ludwig_ai)에서 팔로우해 주세요!
# 🤝 함께 Ludwig를 만들어 갈 커뮤니티에 참여하세요
Ludwig는 여러분과 같은 분들의 기여에 의존하는 활발하게 관리되는 오픈소스 프로젝트입니다. Ludwig를 모든 사람이 사용할 수 있는 더 접근 가능하고 기능이 풍부한 프레임워크로 만들기 위해 활발한 Ludwig 기여자 그룹에 참여하는 것을 고려해 주세요!
## Star History
[](https://star-history.com/#ludwig-ai/ludwig&Date)
# 👋 참여하기
- [Discord](https://discord.gg/CBgdrGnZjy)
- [X (Twitter)](https://twitter.com/ludwig_ai)
- [Medium](https://medium.com/ludwig-ai)
- [GitHub Issues](https://github.com/ludwig-ai/ludwig/issues)
================================================
FILE: RELEASES.md
================================================
# Releasing
## Release procedure
1. Update version number in `ludwig/globals.py`
1. Update version number in `setup.py`
1. Commit
1. Tag the commit with the version number `vX.Y.Z` with a meaningful message
1. Push with `--tags`
1. If a non-patch release, edit the release notes
1. Create a release for Pypi: `python setup.py sdist`
1. Release on Pypi: `twine upaload --repository pypi dist/ludwig-X.Y.Z.tar.gz`
## Release policy
Ludwig follows [Semantic Versioning](https://semver.org).
In general, for major and minor releases, maintainers should all agree on the release.
For patches, in particular time sensitive ones, a single maintainer can release without a full consensus, but this practice should be reserved for critical situations.
================================================
FILE: docker/README.md
================================================
# Ludwig Docker Images
These images provide Ludwig, a toolbox to train and evaluate deep learning models
without the need to write code. Ludwig Docker images contain the full set of pre-requisite
packages to support these capabilities
- text features
- image features
- audio features
- visualizations
- hyperparameter optimization
- distributed training
- model serving
## Repositories
These four repositories contain a version of Ludwig with full features built
from the project's `master` branch.
- `ludwigai/ludwig` Ludwig packaged with PyTorch
- `ludwigai/ludwig-gpu` Ludwig packaged with gpu-enabled version of PyTorch
- `ludwigai/ludwig-ray` Ludwig packaged with PyTorch
and Ray 2.3.1 (https://github.com/ray-project/ray)
- `ludwigai/ludwig-ray-gpu` Ludwig packaged with gpu-enabled versions of PyTorch
and Ray 2.3.1 (https://github.com/ray-project/ray)
## Image Tags
- `master` - built from Ludwig's `master` branch
- `nightly` - nightly build of Ludwig's software.
- `sha-` - version of Ludwig software at designated git sha1
7-character commit point.
## Running Containers
Examples of using the `ludwigai/ludwig:master` image to:
- run the `ludwig cli` command or
- run Python program containing Ludwig api or
- view Ludwig results with Tensorboard
For purposes of the examples assume this host directory structure
```
/top/level/directory/path/
data/
train.csv
src/
config.yaml
ludwig_api_program.py
```
### Run Ludwig CLI
```
# set shell variable to parent directory
parent_path=/top/level/directory/path
# invoke docker run command to execute the ludwig cli
# map host directory ${parent_path}/data to container /data directory
# map host directory ${parent_path}/src to container /src directory
docker run -v ${parent_path}/data:/data \
-v ${parent_path}/src:/src \
ludwigai/ludwig:master \
experiment --config /src/config.yaml \
--dataset /data/train.csv \
--output_directory /src/results
```
Experiment results can be found in host directory `/top/level/directory/path/src/results`
### Run Python program using Ludwig APIs
```
# set shell variable to parent directory
parent_path=/top/level/directory/path
# invoke docker run command to execute Python interpreter
# map host directory ${parent_path}/data to container /data directory
# map host directory ${parent_path}/src to container /src directory
# set current working directory to container /src directory
# change default entrypoint from ludwig to python
docker run -v ${parent_path}/data:/data \
-v ${parent_path}/src:/src \
-w /src \
--entrypoint python \
ludwigai/ludwig:master /src/ludwig_api_program.py
```
Ludwig results can be found in host
directory `/top/level/directory/path/src/results`
### View Ludwig Tensorboard results
```
# set shell variable to parent directory
parent_path=/top/level/directory/path
# invoke docker run command to execute Tensorboard
# map host directory ${parent_path}/src to container /src directory
# set up mapping from localhost port 6006 to container port 6006
# change default entrypoint from ludwig to tensorboard
# --logdir container location of tenorboard logs /src/results/_/model/logs
# --bind_all Tensorboard serves on all public container interfaces
docker run -v ${parent_path}/src:/src \
-p 6006:6006 \
--entrypoint tensorboard \
ludwigai/ludwig:master \
--logdir /src/results/experiment_run/model/logs \
--bind_all
```
Point browser to `http://localhost:6006` to see Tensorboard dashboard.
### Devcontainer
If you want to contribute to Ludwig, you can setup a Docker container with all the dependencies
installed as a full featured development environment. This can be done using devcontainers with VS Code:
https://code.visualstudio.com/docs/devcontainers/containers
You can find the `devcontainer.json` file within the top level `.devcontainer` folder.
================================================
FILE: docker/ludwig/Dockerfile
================================================
#
# Ludwig Docker image with full set of pre-requiste packages to support these capabilities
# text features
# image features
# audio features
# visualizations
# hyperparameter optimization
# distributed training
# model serving
#
FROM python:3.12-slim
RUN apt-get -y update && apt-get -y install \
git \
libsndfile1 \
build-essential \
g++ \
cmake \
ffmpeg \
sox \
libsox-dev
RUN pip install -U pip
WORKDIR /ludwig
RUN pip install --no-cache-dir torch==2.6.0 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
COPY . .
RUN pip install --no-cache-dir '.[full]'
WORKDIR /data
ENTRYPOINT ["ludwig"]
================================================
FILE: docker/ludwig-gpu/Dockerfile
================================================
#
# Ludwig Docker image with full set of pre-requiste packages to support these capabilities
# text features
# image features
# audio features
# visualizations
# hyperparameter optimization
# distributed training
# model serving
#
FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel
RUN apt-get -y update && DEBIAN_FRONTEND="noninteractive" apt-get -y install \
git \
libsndfile1 \
cmake \
ffmpeg \
sox \
libsox-dev
RUN pip install -U pip
WORKDIR /ludwig
COPY . .
RUN pip install --no-cache-dir '.[full]'
WORKDIR /data
ENTRYPOINT ["ludwig"]
================================================
FILE: docker/ludwig-ray/Dockerfile
================================================
#
# Ludwig Docker image with Ray support and full dependencies including:
# text features
# image features
# audio features
# visualizations
# hyperparameter optimization
# distributed training
# model serving
#
FROM rayproject/ray:2.44.1-py312
RUN sudo apt-get update && DEBIAN_FRONTEND="noninteractive" sudo apt-get install -y \
build-essential \
wget \
git \
curl \
libsndfile1 \
cmake \
tzdata \
rsync \
vim \
ffmpeg \
sox \
libsox-dev
RUN pip install -U pip
WORKDIR /ludwig
RUN pip install --no-cache-dir torch==2.6.0 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
COPY . .
RUN pip install --no-cache-dir '.[full]' --extra-index-url https://download.pytorch.org/whl/cpu
================================================
FILE: docker/ludwig-ray-gpu/Dockerfile
================================================
#
# Ludwig Docker image with Ray support and full dependencies including:
# text features
# image features
# audio features
# visualizations
# hyperparameter optimization
# distributed training
# model serving
#
FROM rayproject/ray:2.44.1-py312-cu124
RUN sudo apt-get update && \
DEBIAN_FRONTEND="noninteractive" sudo apt-get install -y \
build-essential \
wget \
git \
curl \
libsndfile1 \
cmake \
tzdata \
rsync \
vim \
ffmpeg \
sox \
libsox-dev
RUN pip install -U pip
WORKDIR /ludwig
RUN pip install --no-cache-dir torch==2.6.0 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu124
COPY . .
RUN pip install --no-cache-dir '.[full]'
================================================
FILE: examples/README.md
================================================
# Examples
This directory contains example programs demonstrating Ludwig's Python APIs.
| Directory | Examples Provided |
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| hyperopt | Demonstrates Ludwig's to hyper-parameter optimization capability. |
| kfold_cv | Provides two examples for performing a k-fold cross validation analysis. One example uses the `ludwig experiment` cli. The other example uses the `ludwig.experiment.kfold_cross_validate()` api function. |
| mnist | Creates a model config data structure from a yaml file and trains a model. Programmatically modify the model config data structure to evaluate several different neural network architectures. Jupyter notebook demonstrates using a hold-out test data set to visualize model performance for alternative model architectures. |
| titanic | Trains a simple model with model config contained in a yaml file. Trains multiple models from yaml files and generate visualizations to compare training results. Jupyter notebook demonstrating how to programmatically create visualizations. |
| serve | Demonstrates running Ludwig http model server. A sample Python program illustrates how to invoke the REST API to get predictions from input features. |
| class_imbalance | Demonstrates using our class balancing feature to over-sample an imbalanced dataset. |
================================================
FILE: examples/calibration/README.md
================================================
# Calibration Examples
Drawing on the methods in
On Calibration of Modern Neural Networks (Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger), Ludwig supports
temperature scaling for binary and category output features. Temperature scaling brings a model’s output probabilities
closer to the true likelihood while preserving the same accuracy and top k predictions.
To enable calibration, add calibration: True to any binary or category feature:
```
output_features:
- name: Cover_Type
type: category
calibration: True
```
With calibration enabled, Ludwig will attempt to find a scale factor (temperature) which will bring the probabilities
closer to the true likelihoods using the validation set. This calibration phase is run after training is complete. If
no validation set is provided, the training set is used for calibration.
To visualize the effects of calibration in Ludwig, you can use Calibration Plots, which bin the data based on model
probability and plot the model probability (X) versus the observed rate (Y) for each bin.
================================================
FILE: examples/calibration/train_forest_cover_calibrated.py
================================================
#!/usr/bin/env python
import copy
import logging
import shutil
import numpy as np
import yaml
import ludwig.visualize
from ludwig.api import LudwigModel
from ludwig.datasets import forest_cover
# clean out prior results
shutil.rmtree("./results_forest_cover", ignore_errors=True)
shutil.rmtree("./visualizations_forest_cover", ignore_errors=True)
# Download and prepare the dataset
dataset = forest_cover.load()
config_yaml = """
input_features:
- name: Elevation
type: number
- name: Aspect
type: number
- name: Slope
type: number
- name: Horizontal_Distance_To_Hydrology
type: number
- name: Vertical_Distance_To_Hydrology
type: number
- name: Horizontal_Distance_To_Roadways
type: number
- name: Hillshade_9am
type: number
- name: Hillshade_Noon
type: number
- name: Hillshade_3pm
type: number
- name: Horizontal_Distance_To_Fire_Points
type: number
- name: Wilderness_Area
type: category
- name: Soil_Type
type: category
output_features:
- name: Cover_Type
type: category
combiner:
type: transformer
trainer:
batch_size: 256
learning_rate: .001
epochs: 1
"""
uncalibrated_config = yaml.safe_load(config_yaml)
scaled_config = copy.deepcopy(uncalibrated_config)
scaled_config["output_features"][0]["calibration"] = True
uncalibrated_model = LudwigModel(config=uncalibrated_config, logging_level=logging.INFO)
uncalibrated_model.train(
dataset,
model_name="uncalibrated",
experiment_name="forest_cover_calibration",
output_directory="results_forest_cover",
)
scaled_model = LudwigModel(config=scaled_config, logging_level=logging.INFO)
scaled_model.train(
dataset, model_name="scaled", experiment_name="forest_cover_calibration", output_directory="results_forest_cover"
)
# Generates predictions and performance statistics for the test set.
uncalibrated_test_stats, uncalibrated_test_predictions, _ = uncalibrated_model.evaluate(
dataset, collect_predictions=True, collect_overall_stats=True
)
scaled_test_stats, scaled_test_predictions, _ = scaled_model.evaluate(
dataset, collect_predictions=True, collect_overall_stats=True
)
uncalibrated_probs = np.stack(uncalibrated_test_predictions["Cover_Type_probabilities"], axis=0)
scaled_probs = np.stack(scaled_test_predictions["Cover_Type_probabilities"], axis=0)
ludwig.visualize.calibration_1_vs_all(
probabilities_per_model=[uncalibrated_probs, scaled_probs],
model_names=["Uncalibrated", "Calibrated"],
ground_truth=dataset["Cover_Type"],
metadata=uncalibrated_model.training_set_metadata,
output_feature_name="Cover_Type",
top_n_classes=[7, 7],
labels_limit=7,
output_directory="visualizations_forest_cover",
file_format="png",
)
================================================
FILE: examples/calibration/train_mushroom_edibility_calibrated.py
================================================
#!/usr/bin/env python
import copy
import logging
import shutil
import numpy as np
import yaml
import ludwig.visualize
from ludwig.api import LudwigModel
from ludwig.datasets import mushroom_edibility
# clean out prior results
shutil.rmtree("./results_mushroom_edibility", ignore_errors=True)
shutil.rmtree("./visualizations_mushroom_edibility", ignore_errors=True)
# Download and prepare the dataset
dataset = mushroom_edibility.load()
# This dataset has no split, so add a split column
dataset.split = np.random.choice(3, len(dataset), p=(0.7, 0.1, 0.2))
config_yaml = """
input_features:
- name: cap-shape
type: category
- name: cap-surface
type: category
- name: cap-color
type: category
- name: bruises?
type: category
- name: odor
type: category
- name: gill-attachment
type: category
- name: gill-spacing
type: category
- name: gill-size
type: category
- name: gill-color
type: category
- name: stalk-shape
type: category
- name: stalk-root
type: category
- name: stalk-surface-above-ring
type: category
- name: stalk-surface-below-ring
type: category
- name: stalk-color-above-ring
type: category
- name: stalk-color-below-ring
type: category
- name: veil-type
type: category
- name: veil-color
type: category
- name: ring-number
type: category
- name: ring-type
type: category
- name: spore-print-color
type: category
- name: population
type: category
- name: habitat
type: category
output_features:
- name: class
type: category
combiner:
type: concat
trainer:
batch_size: 256
learning_rate: .0001
epochs: 10
"""
uncalibrated_config = yaml.safe_load(config_yaml)
scaled_config = copy.deepcopy(uncalibrated_config)
scaled_config["output_features"][0]["calibration"] = True
uncalibrated_model = LudwigModel(config=uncalibrated_config, logging_level=logging.INFO)
uncalibrated_model.train(
dataset,
model_name="uncalibrated",
experiment_name="mushroom_edibility_calibration",
output_directory="results_mushroom_edibility",
)
scaled_model = LudwigModel(config=scaled_config, logging_level=logging.INFO)
scaled_model.train(
dataset,
model_name="scaled",
experiment_name="mushroom_edibility_calibration",
output_directory="results_mushroom_edibility",
)
# Generates predictions and performance statistics for the test set.
uncalibrated_test_stats, uncalibrated_test_predictions, _ = uncalibrated_model.evaluate(
dataset, collect_predictions=True, collect_overall_stats=True
)
scaled_test_stats, scaled_test_predictions, _ = scaled_model.evaluate(
dataset, collect_predictions=True, collect_overall_stats=True
)
uncalibrated_probs = np.stack(uncalibrated_test_predictions["class_probabilities"], axis=0)
scaled_probs = np.stack(scaled_test_predictions["class_probabilities"], axis=0)
ludwig.visualize.calibration_1_vs_all(
probabilities_per_model=[uncalibrated_probs, scaled_probs],
model_names=["Uncalibrated", "Calibrated"],
ground_truth=dataset["class"],
metadata=uncalibrated_model.training_set_metadata,
output_feature_name="class",
top_n_classes=[3, 3],
labels_limit=3,
output_directory="visualizations_mushroom_edibility",
file_format="png",
)
================================================
FILE: examples/class_imbalance/README.md
================================================
# Credit Card Fraud Detection Example
This API example is based on Kaggle's [Imbalanced Insurance](https://www.kaggle.com/arashnic/imbalanced-data-practice) dataset for detecting whether customers will buy vehicle insurance.
### Preparatory Steps
Create and download your [Kaggle API Credentials](https://github.com/Kaggle/kaggle-api#api-credentials).
The Imbalanced Insurance dataset is hosted by Kaggle, and as such Ludwig will need to authenticate you through the Kaggle API to download the dataset.
### Examples
| File | Description |
| ---------------------------- | -------------------------------------------------------------------------------------------------------------- |
| model_training.py | Demonstrates the use of oversampling by training two different models: one with no oversampling, and one with. |
| model_training_results.ipynb | Example for extracting training statistics and generating visualizations. |
Enter `python model_training.py` will train a standard model with no class balancing and a balanced model with class balancing applied. Results of model training will be stored in this location.
```
./results/
balance_example_standard_model/
balance_example_balanced_model/
```
The only difference between these two models is that the balanced model uses a small amount of oversampling in addition to the other configuration parameters.
The way this is done is by specifying the ratio that you want the minority class to have in relation to the majority class.
For instance, if you specify 0.5 for the `oversample_minority` preprocessing parameter, the minority class will be oversampled until it makes up 50% of the majority class.
In this example, we had an imbalance where the minority class was 19% of the majority class in size. We decided that we wanted to increase that to 26%.
Though this doesn't seem like much, it is enough to get some small performance improvements without experiencing performance degradation due to over-fitting.
Here are the performance differences in the two models followed by some plots showing different metrics over training:
| Metric | Standard Model | Balanced Model |
| :------: | :------------: | :------------: |
| Loss | 0.3649 | 0.2758 |
| Accuracy | 0.7732 | 0.8237 |
| ROC AUC | 0.8533 | 0.8660 |
Here are the learning curve plots from both models:


Here is the comparison of model performances on ROC_AUC and Accuracy:

The creation of the learning curves is demonstrated in the Jupyter notebook `model_training_results.ipynb`. The comparison plot was generated using the ludwig visualize [compare performance](https://ludwig-ai.github.io/ludwig-docs/0.4/user_guide/visualizations/#compare-performance) command.
================================================
FILE: examples/class_imbalance/balanced_model_config.yaml
================================================
input_features:
- name: Gender
type: category
- name: Age
type: number
- name: Driving_License
type: binary
- name: Region_Code
type: number
- name: Previously_Insured
type: binary
- name: Vehicle_Age
type: category
- name: Vehicle_Damage
type: category
- name: Annual_Premium
type: number
- name: Policy_Sales_Channel
type: number
- name: Vintage
type: number
output_features:
- name: Response
type: binary
preprocessing:
oversample_minority: 0.26
trainer:
learning_rate: 0.0001
learning_rate_scheduler:
decay: exponential
decay_rate: 0.9
decay_steps: 30000
staircase: True
epochs: 50
================================================
FILE: examples/class_imbalance/model_training.py
================================================
#!/usr/bin/env python
# # Class Imbalance Model Training Example
#
# This example trains a model utilizing a standard config, and then a config using oversampling
import logging
import shutil
# Import required libraries
from ludwig.api import LudwigModel
from ludwig.datasets import imbalanced_insurance
from ludwig.visualize import compare_performance
# clean out old results
shutil.rmtree("./results", ignore_errors=True)
shutil.rmtree("./visualizations", ignore_errors=True)
# list models to train
list_of_model_ids = ["standard_model", "balanced_model"]
list_of_eval_stats = []
training_set, val_set, test_set = imbalanced_insurance.load()
# Train models
for model_id in list_of_model_ids:
print(">>>> training: ", model_id)
# Define Ludwig model object that drive model training
model = LudwigModel(config=model_id + "_config.yaml", logging_level=logging.WARN)
# initiate model training
train_stats, _, _ = model.train(
training_set=training_set,
validation_set=val_set,
test_set=test_set,
experiment_name="balance_example",
model_name=model_id,
skip_save_model=True,
)
# evaluate model on test_set
eval_stats, _, _ = model.evaluate(test_set)
# save eval stats for later use
list_of_eval_stats.append(eval_stats)
print(">>>>>>> completed: ", model_id, "\n")
compare_performance(
list_of_eval_stats,
"Response",
model_names=list_of_model_ids,
output_directory="./visualizations",
file_format="png",
)
================================================
FILE: examples/class_imbalance/model_training_results.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "8c1e31e4-d8d4-4e83-8f4c-f868365d14d7",
"metadata": {},
"source": [
"# Model Analysis\n",
"\n",
"This notebook will analyze the training results of the standard and balanced model on the [imbalanced insurance](https://www.kaggle.com/arashnic/imbalanced-data-practice) dataset. In order for the cells in this notebook to run, you must first run the following command to train the models:\n",
"```\n",
"python model_training.py\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "a3b3b6c1-5d11-4a03-9dfa-b070e45b2adb",
"metadata": {},
"source": [
"## Import required libraries"
]
},
{
"cell_type": "code",
"execution_count": 226,
"id": "6a6bbd43-1333-4a0e-b895-c3e393d5ee07",
"metadata": {},
"outputs": [],
"source": [
"from ludwig.utils.data_utils import load_json\n",
"from ludwig.visualize import learning_curves\n",
"import pandas as pd\n",
"import numpy as np\n",
"import os.path\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"id": "b5e4d240-7607-4036-9573-6b452523c18f",
"metadata": {},
"source": [
"## Learning Curves"
]
},
{
"cell_type": "markdown",
"id": "725dde7d-f7a5-4836-8cf1-a3f50acafd30",
"metadata": {},
"source": [
"### Create Plotting Data Function "
]
},
{
"cell_type": "code",
"execution_count": 227,
"id": "d8c35325-cfab-4ead-8699-1e98e55a8c7b",
"metadata": {},
"outputs": [],
"source": [
"def create_plot_ready_data(list_of_train_stats, model_names, metric, target):\n",
" # List of splits to evaluate statistics for\n",
" list_of_splits = ['training', 'validation', 'test'] \n",
" \n",
" # Empty list to fill with dfs for each models' stats\n",
" list_of_train_stats_df = []\n",
" \n",
" # For each models' stats, create a df with columns of stats for each split listed above\n",
" for name, stats in zip(model_names, list_of_train_stats):\n",
" list_of_dfs = []\n",
" for split in list_of_splits:\n",
" df = pd.DataFrame(stats[split][target])\n",
" df.columns = [split + '_' + c for c in df.columns]\n",
" list_of_dfs.append(df)\n",
" \n",
" combined_df = pd.concat(list_of_dfs, axis=1)\n",
" combined_df.name = name\n",
" combined_df['epoch'] = combined_df.index + 1\n",
" list_of_train_stats_df.append(combined_df)\n",
" \n",
" # holding ready for plot ready data\n",
" plot_ready_list = []\n",
" \n",
" # consolidate the multiple training statistics dataframes\n",
" for df in list_of_train_stats_df:\n",
" for col in ['training', 'validation']:\n",
" df2 = df[['epoch', col + '_{}'.format(metric)]].copy()\n",
" df2.columns = ['epoch', '{}'.format(metric)]\n",
" df2['split'] = col\n",
" df2['model'] = df.name\n",
" plot_ready_list.append(df2)\n",
"\n",
" return pd.concat(plot_ready_list, axis=0, ignore_index=True)"
]
},
{
"cell_type": "markdown",
"id": "722f5f48-7026-427e-9bb7-388fec15a24f",
"metadata": {
"tags": []
},
"source": [
"### Create Plotting Data"
]
},
{
"cell_type": "code",
"execution_count": 228,
"id": "6b48aef8-11a8-4bba-8d52-114adb9cb2f2",
"metadata": {},
"outputs": [],
"source": [
"standard_stats = load_json(os.path.join('results/balance_example_standard_model','training_statistics.json'))\n",
"balanced_stats = load_json(os.path.join('results/balance_example_balanced_model','training_statistics.json'))\n",
"\n",
"accuracy_learning_curves = create_plot_ready_data([standard_stats, balanced_stats], ['standard_model', 'balanced_model'], 'accuracy', 'Response')\n",
"roc_auc_learning_curves = create_plot_ready_data([standard_stats, balanced_stats], ['standard_model', 'balanced_model'], 'roc_auc', 'Response')\n",
"loss_learning_curves = create_plot_ready_data([standard_stats, balanced_stats], ['standard_model', 'balanced_model'], 'loss', 'Response')"
]
},
{
"cell_type": "code",
"execution_count": 229,
"id": "a06aaa41-e692-4348-aab5-6bd78711f6f1",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAmkAAAGICAYAAAAagXdoAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAACwC0lEQVR4nOzddXhb5/nw8e85OkJLMtuJw3jCaaCctiljSiutTNu6des6eLeO6TeGjte1a7uuzExpm2KKYT5hTswk1oH3jyM7dmKQSbLj53NdvhLr0K1j2br1wP1IlmUhCIIgCIIg9C9ytgMQBEEQBEEQDiWSNEEQBEEQhH5IJGmCIAiCIAj9kEjSBEEQBEEQ+iGRpAmCIAiCIPRDIkkTBEEQBEHoh5RsByAIQs+pqroCmAkcpWnaZ1kOJ2tUVf0p8G1N0/zZjqU9qqq+A4Q0TTsvQ9c7Cfg6cAyQC2wHHgH+rGlaOBMxCILQPaIlTRAGOFVVpwMzgHXAzVkOR+jcV4BvZeJCqqp+F3gbkIDbgAXAA6nrL1RVNScTcQiC0D2iJU0QBr7rgJXA/4Cfqar6TdFC0n9pmrYuE9dJtaD9GviNpmnfb7Fpkaqq7wOLgW8Cv8hEPIIgdJ1I0gRhAFNV1QFcid068jjwB+Ay4P4W+4wCfg+cBljAO8A3NE3b2dl2VVWvT52rWNO0qtT+eUAtcIOmaf9NdTGeB7wP3ACs1TTteFVVhwK/BM4CioFK4Angu5qmxVPn8gI/Bz4P5AGrUtvfV1X1aUDVNG3aQc9ZA17UNO3bPbhvnwe+D0wA9mB3/f2txfYgdvJyITAUqAdeAb6uaVpdah8L+AFwFTAE+FLqPvhT9+KbQAnwMfAVTdPWp457h1R3p6qq87Fbuk4EfgPMAfYCv9I07T8t4pkJ3AkcDZQDPwZ+CjykadpP23ma38a+5z8/eIOmaR+pqvpjYGvq/E1xHKlp2pIW161L3Zufpl4LfwB+C3wXqMb+cDBT0zT1oPu7BFinadq1qe9vA74GjAQ2Az/XNO3xFvufk4pzChACXsLutq5p57kJwqAgujsFYWA7HTuJeFjTtL3AW7To8kwlGx9gd4d+BbgemAS8qqqqo7PtXYhjJnAk8DngV6qqysBrwGzgVuBM4EHssVFfbHHcY6nvf4edEJWnrj0eu2Vwaqo7t+n5HAlMTJ2rW1RVvQ57TNa7wPnYCe6dqqr+vxa7PQJcANwBnIGdnFwJ/Oig0/0U+DtwC3ZiBnaye13quV6NnQj+t5OwHgWeBs4BlgP3qKo6JRVvKXYC5QWuwE6S/gqM6OA5StivjUWapsXa2kfTtP/TNO2RTuI6WB52In4VdoL6H2CiqqozWlx7DHay+Wjq+58Af8T+WS8A3gAeVVX10tT2UcAz2C1752B3xS4A/tHF2AThsCNa0gRhYLsWWK5p2prU9/8DHlRVdXKq5eYG7FaeiZqmbQNQVXUX8Cx2MnZaJ9vTpQC3N7XCqKo6Aru17TZN01al9lmkqupZwEnA31KtQ+cD12qa9mDquPewk5TjsROlSuzk6Hupc1wFrNE0bWUXYmuWSh5/hZ3UfjX18MJUq9iPVFX9J2AALuAWTdNeS+3zjqqqx6Vib2mhpmn/anF+gABwrqZp+1KPDQP+oqpqoaZp1e2E9ldN0/6U2n8ZcBFwNvY4w9uwP1Cf3aIVrwp4qoOnWgS4gR0d3Y9ucAA/1jTt9VQcDuzE+lLsVlCAy4Eq4I1Uq+sdwG81TWtKcBeqqhrAbjl8EjgqFetvWtyzEDCql2MXhAFHJGmCMECl3uguAH6dejMEWAREsFvTvgUch939uK3pOE3TVgBjUuf4cSfbj+xCSOtbnGMXMF9VVVlV1QnYrV8zgVJgZ2q341L/vtjiuAQwtcVzfAy79eh7qYTgCuxWme6aCJQBL6uq2vLv36vY3W1HaZr2NnbrGaqqjk4dMw27K+7gVqm2xpftaEo2Unan/s3B7iJsy8dN/9E0rS6VpDQN6p8PvNOUoKU8B+jtnAvsRBP6prek+TlrmmaoqvoEdpLWlIRdBjypaZququoxgIe27/eNqVa3pUAc+DT1834ZeEHTNANBGOREd6cgDFyXAj7ssVO1qa89qceuVVXVBRQAFR2co7Pt6QofPFlBVdWbsMdXbQTuwe4OjWLPNGy6dvKg5ONgDwCjVVU9FrvVrxi7ha27ClP/PgIkW3w1lS0Zmor9fFVVtwDbgIexuw4jLWJv0ta9ixz0vZn6t6O/t20d07R/EXaLYrNUAlPV3slSY7lC2GPA2qSqaomqqs4OYmrPwc/5Eft06nRVVccBs0h1dXLgfn9I6/v9ZOrxoZqmbQVOxR7f9jXsrt3dqqpe1I3YBOGwIlrSBGHguhb4FHsQd0tTscdJXYA94H3cwQeqqno2sCyN7VbqoZYJRqc1yFIzC+/BTiD/rmlaZerxT1vsVg84VVXN1TStvsWxxwK1mqZt0DRtqaqqa4FLsLsRF2matqez63eg6Tq3Yt+7g21Ltfw9iZ0gnqRp2u5UXE9gt6Zl2l7s5LRZqtu2sO3dm70BnKyqqivVQnmw++1TqRNo4+ecGtfWaYkOTdM+VlV1K/Z4xDiwC3ucIxy43xdxoEWx1eGpcywGzlNV1YedsH0HeFJV1ZGpsZaCMCiJljRBGIBUVR2JPSPwQU3T3mn5BdwF7Mfu8vwQmJYanN107GTsmYoz09jekHq4rMXlT0gjxGOw3/j/r0WCVgZM50Br1Iepf5uLuqZa/57ATkCbPIidcJ5LDyYMpGzA7nIcrmnakqYv7ITnF9jFXmdjj0n7TYsELQeYx6EtaZnwHnbXcbDFY2cDnbWC/Rl7dunBkx2aZnOeCTyiaZpF2z/nY0j/g/xj2D+fi4HHUucE+AS75azkoPs9DXuGqqSq6s2qqm5VVdWpaVpE07QXgR9ij38blub1BeGwJFrSBGFguhY7CXr64A2pcUKPY3cdfRH4BvBSqlSGgZ2MfIo9fu3TTrb7scdh/UVV1f/D7j77EXaLSUc+w/4Q+GdVVZ9MHfcD7AHivlScy1RVfQl7EkEQuzTDLditN/9uca6HsAf7x7BnAXbGqarq7W08vkrTtEWp5/mn1CD/t7DH3/0a2ITdvamk7sNvVVX9F3Z347exJ1h09rz7wl+xf5Yvq6r6W+xWtV+ltpntHaRp2nuqqv4e+KGqqpOwuyVD2En2N4GPsEukgD3ofw/wC1VVk0AQe4xe/SEnbtvDHJjc0Tx7V9O0SlVV/wr8UVXVfOzX1RGp6z6vaVpDarLI37Bbzv6JnSD/EPtnsSLN6wvCYUm0pAnCwHQ1sPigAeotPYz9+30jdovbZuwyEP/BfuNboGmanhoP1tn2y7ATg5ewuwmvwX6zb5emaYuwE4HzsAeJ/wh7NuLPgVmqqrpTu16O3Tr2E+wZpQXAqZqm7Whxrj3YScSzmqZ1eN0UF3ZNsYO/Lkudr6lkxvnYLYY/x+7ePFfTNEvTtI3YSfCM1PbfAUuwS5SMTLUIZkxqRujp2D/Pp7Dv5TdSmzv7OXwHe7JFEXbi+xx21+P/AWc21atLjXG7jAOJ8E+wuxw3pxnjOmA1sFHTtOUHbf4OduL/BeyyLF/HbuW7PnXsRuySGyWp5/cI9ozR0zVNS6ZzfUE4XEmWZXW+lyAIQpaodlHcXcBZmqa9me14Mi01Rs+nadpbLR6biD2e6wJN017IWnCCIPQp0d0pCEK/lJopeDV2kdv12F2Tg9E44D5VVb+H3Y1cit11vBFYmM3ABEHoW6K7UxCE/koCbseutH91i8Hog4qmaQ9hdxneDLyOXYl/DXBye6sJCIJweBDdnYIgCIIgCP2QaEkTBEEQBEHohzI2Ji1VfPGf2LWX4sDNmqZtbrH9KuxlbAzgvqb18FRVXc6BaeDbNE27IVMxC4IgCIIgZEsmJw5cCHg0TTs2tZ7bH7ELVDb5A3al9BCwLrWGWxRA07T56V7ENE3LMDruwjVNu7SQLIuGxM6Ie5UecZ/SJ+5V+sS9So+4T+kT9yo9mbxPTqejioNWFWmSySRtHnaNnKZlROYetH0VdrVvHXvAsIXd6uZTVXVhKtbva5r2MR1IJg3276/pMJBo1F4mz+v1df1ZDDLiXqVH3Kf0iXuVPnGv0iPuU/rEvUpPJu/TqFGlO9rblslUOkjr6tWGqqotk8Q1wFJgLfBSqohmBLuF7Uzs4pMPH3SMIAiCIAjCYSmTCU8D9gLJTWRN03QAVVVnYK/7Nga7u/MhVVUvBV4ANqem3m9UVbUaaCps2SZZlvH50st8091PEPcqXeI+pU/cq/SJe5UecZ/SJ+5VerJ9nzLZkrYYOAcgNSZtdYtt9djjz6Kp5UkqgHzsJW3+mDqmDLs1rr1lcARBEARBEA4bmWxJexY4XVXVD7HHnN2gquqVgF/TtLtVVf038IGqqglgC/Y6ggD/VVX1A+wxajc2tb4JgiAIgiAczjKWpGmaZmKPK2tpQ4vtdwF3tXHolX0ZlyAIgiAIQn8k5uAKgiAIgiD0QyJJEwRBEARB6IdEkiYIgiAIgtAPiSRNEARBEAShHxJJmiAIgiAIQj8kkjRBEARBEIR+SCyx1A2RaJS2lnB3OJxIsoyh61iWcch2SXLgcTuRJanvgxQEQRD6DcO0CCd0IgmDIUEPAJsrwzgdEoU5LnJcDiTx3iAcRCRpXfTy2nKOWnQxU+RD10M9J/4r1lmj+bHyP25UXjtk+8+T17Bu+JX87ZLpmQhVOMzsb4jxwdYaNlWG2VEboaIxTm00idvhoDToJp402FEbPeQ4WYJcrxPDtGiI6ZiW/RHDskCSQJYkCnxOctwKsaRJ0jBxOiRcDhmXIuNRZEbl+yjL9RDXDWqiSXxOB4osEddNYrqJz+mgyO+iIaazdl8DScMiYdjnSpoWsiRRFvQQN0y2VoUxTAunIuOUJZwOGZdDYvaIXHK9LvbVx2iM6bidMl6nA2/q35lluYwu9FHRGGNjZRjdsNDNpi+TYr+b8UU5NMSSfLStFt2yMEz7SzdN3A6Z2SPyiOsmizZVEYrrdnyGRdK0/51SGsC0LLZUhamJJDEtC9MC07JwyjLTygLMHZGH1+kgrhtMLg0wssBLid+N09G7HRO6adEY1zFNC11KIskSMiDLEhJ2dW8rFZtlgYUdKxaYqf+bpoXR4j6Ylp0sND2mp553Qjeb/6+bFrkeJ7OG5+JSRGdLE8uyiCQNTAsaYzpaRYglO+uoiyZJGCYJ3SRhmBTkuDh6ZD4Jw+CBT3cT0w2iSZO4bgLgdcr86rzJyJLE797azJ76GAAuh0y+z0lhjosfnDGB8UU5LNlZx576GIU5LgpzXBTluCjwOXv9tSb0X5JltdUmNHAlk4ZVVxfpcJ9IxN7enTW5KhrjfPri3/Emqg/ZtnvkRVg5pfj2vE9hw9pDtr9nTOOT5Fheu+WYLl83W3pyrwaT3rhP++pjLN1dx8aKMNtrIuxviFMTSRD0OIkmDarCCQAkTKZL25ghb2WWYxsjlVpqPSPY5RjBvZH5GLJCy8/jbkXmiGG5yDIs31VP3DCxsJf9MFJJzqzhQSxLYn15I5WhRHNy0sQhgdHNPxWyBE6HzITiHFwOmc1VIZLGgYShKclwKzJx3WyzlbovSRxIVocE3PhcDhpiOjHdQJFlFFlCkSXCCbt1vDaabPM8XqfMkSPyuWx2GbkehRW76xkS9FAScGNZdsLldTqYURakLprkrsXbqYkkqY0kaYzrhOI6cd1k2tAg1eEEm6vC6Gb2/j4X+JycOrGI648eSYnf3afX2lDeyOPL93LdkcMZXZjDok1VvLK2nKBHIehxkutVyPUoTB4SYHJpgLhuUpv63fA6ZaJR+8NJy98/07KTz5huJ0j2l8GEYj8AK/fUUx1OENNNokmDxphOY9zgwhlDcEgSTyzfw9ubqgglDKJJg2TqF6ApQc4ESbIT8YONzPMwJOihPpZkd10MWZJwyKT+lRgadDNlSBDdMFm+ux6HLKE47A9FMhaXzizl9CllGXoW3ZSMoNRuQg6XI8XriY87F5yZex/K5HtfcXFgKTC3rW0iScugpx76M2Nr3mH2bc9l9Lo9IZK09HT3Phmmycvrynng093sqj20G93rMDk+WMXJ/l0UKAmWDbmcycUeLnxrHrIRw/QWYviH4ajfBkD1zetAkgi+eA1yMoyePwGjYAJ6wUSM/AmYOUPsv/xpanqjc8gSkiRRHY6zuy6WaoWy8Lkc+FwOCn1OSgMeHDIkDAunw05uunqvLMsiljQIJQwaYkkaYwahuE5jQkdGwrDsLqNo0kCRZJwOCcUhocgyPpeDgNtOUBOGidNhb7f/td+gmloGPU4HbsWOMZ0uJt+nf0QvmkZi7JlEkwabK8Os2lvPxsowO1IJdV1Ux+jk76nLYV+vqVWlre1jCnwU+d1EEjoyFg5ZQnY4aPpbrZb4CXqc7KyNsKc+1vwmbqZa0yYU5TC2KIeqUJwlu+pRZKn5TVyRJUoCbo4bXQASvLWxCkUGRZaRJcm+n7LE7BF5PLF8L5/sqAVg6pAAt88fwxHD8jq9V+nSDbtF85Gle1i7vxE5lZB4nHYsCcPEssBIJfEAE4pzmDsij8a4zktrywH7A4TX6cClyIzI83Ll3OEkdIMfvaK1ed0vHTcS3YSX1pZT3hg/ZLsMHPzT8SoyQY9Cns/JxBI/M8uC5HkcHLHhdyiBUqpn3orX6bA/+Fj269iwLEzTjt885P/2h5SE0TKBtJPIuG4SS7W8hVO/Bw0xnca4TjhhEEno+N0KiixTH01QfVCLr2XZP2eHLJE0TAxDp4Q6hkrVyJgssSaR61F489bjeu1nmTYr1dQryThqN6PsX4YjvB85XI4c3o8c3k9s8uXEpl2Lc9d75L1wYMGhxhN/SWz6dRkLtb8kaaK7M4MK9X2c6VjCjlgcn6dvP5kK/d/+hhg3PrKCynCCqUMCXDhjCH63wrR8i/n77yO/fg3O6rVI0RhEwQiMYO6x3wFJosH7AEZwFGZgWPPHbSlW25yAGQUTkcpX4N7yMvK6uuZrVl/1PmbeGNza08iRKpJDj0QfMrvdGGVJwuN0NH9fGvBQGvB0+Ly8PeiJkSQJr0vB61Io7uPWm3Q5qjeQ89md6HnjSIw9E6/TwfSyINPLgq32syyLylCCHbURtldH2FgZZktVmH0NMRSHTL7XSZHfxcg8H4U5Tgp8Tor8brsby+ci6FUOGa/a128U88cXtXwCuHYswrPmARpH/IPjRk/lg3de4IENsGY/fOGxVZT4Xdx0zEjOnz603SQ8He9vqeYnr2o0xg8sxTw818sJ4wqxsAgnDKIJg0jSIJKwE/VQwqA+FGbZmh1oCTvu2xzPMFfWGMN+vEacZZUT+OTlyTxjzAOCbV773x/uBECRwa1IKJKM1+VgWNBNWZ6HobleyoJuhgY9lOV6KA3YXdlyeD+uHYtQypcTmvY7kCTyVm7EufMxgtPOxcif0u370W2WiRypRA7tRQ7txcwZgj5kDo7azQQWfct+PFyOZNlpZ7xoGl9w/o6PdtQRSxqtfrf7lJEk7/nLUCrXUH/2PSRHzse1bSH+j34FgOnOw8wpxfQPwXLaLZ168XTqz74X0z+E3JeuRalclZlY+xmRpGWQmXrxVdXWMnLokCxHI2TTwg0V/PhVDcO0uHREmP9z/wvLzKPxxH+BkaTwk+cxCiYQnXo1evEM9NIjMHJHNydhyeHHtz6hJGF5C5q/DR//I/s/loUUrUap3YijZiNmcCQA7i2v4N72OgDR6dcTOu4HoHj7/HkPRJ71TwBQd/EzHe4nSXYrVUnAzZEj8zMRWu8wddybX8K37B8o1esx/ENx1G3DKp7OxXt/x6XspGbITB6KHM1/G2bz6zcT3P/JLi6ZOZQLZwwl1+tM6zIrdtfx4bZadtRG+WBrNQnDIt+rcN60IZw1qYQJxTl2q6ZlgiSDZeFddS+Oum046rfjqN+GnNgNDqj4+iZipkL+Gw/jaIAGzxyikpMT6lZyRngpCy6+CTMwnJKtT+BMNhAfegxGyXQUxYnikHFIpNWCqlSswrX0DVzb38KZShKMwHCkaDWWr4j68/5HwYPHkvPZnTScfU+Pfgxd4VvyFzzrHkMO70cyD3S/RydfQWjIHCzFh+XwkBw+D8NfhukfiukvwwiO5IvrlnP+7mdZsnMK88YVdXCV3qNUrcG57zNi4xfYrflAbPLlxMedi5lT0ubfHsuTT2LsmQDoxdNQKtdkJNb+RiRpmeS2k7S6epGkDVamZfGzVzVeWV+BLMEdp4zmps1fxlG5ndjUa+ydHE6qb1plv1H1lCRh+YpI+opIDjvQvdFwzr1I0Rp8S/+Gb+U9OPd8RMMZ/8AonNTzax5OjASejU8TH3s2lrcQ19bX8Kx/goaz7wZ54P/5dO54m8B7P8TRsAM9fzwNp/yJ+MQLweECoO7CJ3Fveo6g9gy3Je7ma16F7XnHcIfj2/z9g+3844PtHDUqj9tPGsf44pxDzh9N6Pzn4528sGY/dVG71Szfq3DRjKGcMamEGXlJnFWrce58CWXpKpSq9ZiuAHWXvwaShG/p38FIYOSNIVk6C2PiRRh5Y5Cxu9rj5/4bgESqxTHk8xEJ72dYKhEIlr/X/GHEdOagD51LouxYYpMvx/IVH3pDEmFc+z4hMfJkkCT879yBUrUGfcgcQsfcQWL0aRgFavOHJcuTR3TGjeQs+QuOqnUYRZlpTYuplyI37MTyFqaSMPvLCAy3n2ugjPoLH2/z2ImuRcxzvMc173/KvHHnZCRe5/6lgP3h0fTbY+EsbyGWtzCt4yMzv4CkHzopajAY+H9lBhCH225+DzXWZjkSIRuqwglufXIVW6sjBN0Kd18xg2l7HsNZsZKG0/9uvzk26Y0ErROWt4DwvJ+QGHEiwbe+Qc5Hv6LhvP/1+XUHEteOt5Cj1cQmXwGAlAzh3r6QnA9/SXjeT7IcXfdIiUakaDVm7mgslx/Tk0fo+B+SGHPmIa87MzCM6Oxbic6+FUfVOjwbn6Wsfjt/P3sujyzZRc7Hv+a9XZO5+n/VDMvzc8vxozllYhHLd9fz1/e2sqE8ZE9SkeCYUvjy+HqmD8nBGD0eObyfwvsPDMPRc8eQLJmBUXgg0am56j0sV6Br4yhzDnwAbjjnXuRwOc69n+Dc+zHOPR/j//g3xCecjwV41j6EHK3GdAVw71iEc89HSEacmivfxcgfR+Mpf8DMGdKqlfpg0Zk34111HzlL/kLDWf9OO85uMxKYgTJCp/yxW4c3DW/Ir12FbpyFkoGZosr+Zc3JZHckR87v3YAGEJGkZZDTZydp0VBddgMRMu7DbTX87DWNUFznmNH5/P78Kfii+8j5+HfER55MfMIFWYstOepkaq54o3nciqN6PaavtMM3psFCjlSh508gMfIkAOLqJUTKV+JbeQ96yQziEy/q8xiUhh3gHA3OnnVHS9FqvCvvxbvmAfTCSdRf9DT60COpu+SltJIgo2gK4RYtRVdPUshf/i43Sy9QTR7Ph47hsVeO5yevjSNpQDH1fMP7AacF9zDR3IJSvwuWQrJ4BnWjT8H0lRKa9zP0wknoxdOx3IeOIWvrsa4yc0qJTzif+ITzm++D5bFf287dH+LZ/AJgJ4nRadeRGH0qRmpYQDotY5Ynn8hR38JKddF2JaHsKueu9wks+hb15z2IUah26xzJ/InEJTez5E28tqGC86b2fa+Oc/9SkkPmdP8Epo5bexojbyz60CN7L7ABQMzuzKBVm7fy8IsvcOr8szljVvd+wTJNzO5MT3v3KWmY/GrhJl5aV86oAi+/O38KYwvtbiHvirvJ+eQP1Hx+EWZweMZjbpNlkv/oKUjxRhpP+zPJESf0+iUG3Gvq4DdeI0nu81fgrFxJ7cXPYxRP7bNLR8JhRjw8BzkZwvAPxcgdY3/ljSU6/TpQPAfGcLVDbtiNb8VdeNY/BnqcxLizicz6CnrpET0P0Ijbkw20Z3BufxPZTLIoeDF75v6Q4wvCjH3qBIzgKJIlM9CLp9vjK4unYXnyen7tFnrympJitUiJxubxmv2WZZH31HnIkSpqrnrX/tl3QyQSIfeFK9leWcNPi//KPVcc0btxtkEO7UXSYxh5Y7t3Asui8N5pxMcvIDT/N70bXHvWPElgw0OEzrkXy9e3Y/fE7M5+orB4KIvM2cxz5mY7FCEDdtZGue3p1eypj+FWZH50xsTmBA0gesQXiY87DzPQj+oVSTINZ/yT4MJbyXvh80Rm3UL46O80j1EaTBzV6zEDI7Bc/oM2OGk46y7ynzib4BtfpfaKN0HuxVlypk7O4p8Tm/J58Iyg+rhf4IvtxVG3FUfdNtxbX0FKhInOvBmA3Oc/j6NxVyp5G92cyCXLjsZSvOQ9exFypIqYejHRWV/GyB/fe7E63CTGnk1i7NlI8XrcW17hqGg10cklYFlU3bQay9O/J1FYnvxeiVGK1eJb8ldik6/oditXR1xbX7GHRpzyp24naE2kstlMrbqXreU1vRRdx7rbzdlMktCLpqFUru6dgNLgLl+Cq3pD1l+/IknLIK8V5WbHy4R2hmHa2dkOR+hDL63dzy8XbkI3LUbkefjXZTMpDdglJaRoDc69H5EYe07/StBSjKIp1F76Cv7FP8e3/C6cuz+k8Yy/d/9T8EBkWQRfuwUzMIz68x85dLOvmIaz/2N/05sJWjJC8PVbcO9YhBkYDhOvJjL2XDiohUiKNzRfNzH6NJSKFXYCpy1DTjQCUHPFmxiFk2g85U6MvLF9/lqz3Ll2YtkcpJT1N7hM86x7BDlcTuOZ/+zdE5s6OR//Dj1/InH1cz0+XWjC5/jznvFE9sLGihATS/ydH9RN3pX3Ijfu7vEYTr14Gt5V94ORBEd6M4p7wl21ikTRtN79/e4GkaRlkF+BHzof5r4KHyCStMNROKHzmzc28dqGSgBOV4v56Vlqq+V1/It/jnvTc9Rc9V7/7WJxegnN/zWJkScRWPRtHDXaoErSlP1LUeq20DD7K+3u09xdaOq4dr5LYvSpPbqmFK1O1YNaTeNJvyE27WqItD10o+VYregRX2ixwS654qjfhpE3BoDkiHk9iktIj+XJJzb9BrzL/kHkyNsxCib22rk9G55EqdtC/dn/6ZWkQc8dwzHzhnH/E6vYUN63SZp784u9ch69ZAaSmcBRu6nvZ9FaFsm8CSTye+9n2F1iAbAMcqUmDih6OMuRCH1hQ0WYqx9cxkKtkjnDc/l/p4zjl+dOapWgOXe+i0d7isisr/TfBK2FxNizqLnmQxJj7Q8VnjX/Q4rXZzmqvudZ/xiW4iM+7rxO9/Wu/i+5L1+Ha/NL3b6eXL+DvKcvQKneQMNZ99gJWnekSq7oQ48ER/8oBjyYRI74IihefEv+0qvnNQIjiE663J6B20uObXyVb/pe4/2thy5x2GuMOErFqp5NGkjRi+01r50VGShqK0lUn/AbGqfd2PfX6oRI0jJJVohYblxGKNuRCL3sox11fPmZ9TTEdP592Uzuunwml80a1rpgZjJC4N3voeeNJTL3tuwF20VNrTaOmk343/8x+Y+dgbL30yxH1YcSYdybXyQ2YQG4Dq39dbDotGtIls4m+NY3cVS3vRRRZ5zly5Dj9dRd8HhzAU9h4LG8BURnXI970ws4ajb12nmTI+YROvWPvTpz1LV7MVfzMu9urqI20vZ6tD2lVK5BMhMkO1jVJF1G7mgis29FL5rcC5F1TA7ttYcU9AMiScuwEF5cxuAsyne42lwV5kevb8G0YFS+lxnD2i4bkPPpH3E07CQ0/7c9HvibDUbBBOoufhZkhbznLiHwxtfwv/1dlNQnW+eu9wi88TUCC28l8NotBF/7IsFXbsKz9iEA5IZd5L5wJSWvXUdw9b3ZfCodcm95GTkZbq6N1imHm4az78Zy5hB89eYutTTK9dsBiE+8iJqr3kcf2uYEL2EAiRzxJSxnDs7dH/T4XFK8gcAbX8NRt7UXImstOWQOBWY1Q6nm8eV7ev38AM79ywDQe6ElDUkmfOz30Etm9vxcncj56NeUPdd5K3omiCQtwyJ48Vqiu/NwURtJ8LUnV5E0LKYN8XPXZTMPWX8RaC5VEJ1yJclhx2Y+0F6il86i9vLXiU26FNeu93FtfwM5bC90LUcqcO5fhlKxCqVmA47aLTgadiBHU8WbLRMpEcIRqSB/yW9x7l6cxWfSPiN/PJEZN6IPST9hMnOGUH/Wv3E07iLw5tftshidcG94ioJH5uPe+BxAr5elELLD8hZSc+1HxGbc0ONzeVf8G8/GZ5GSvf+e0VTUdpa8mbc2VvX6+cEe22kEhrcqMNwTcmgfbu0pMPXOd+4BpXw5iaLpfXqNdImJAxn2iutMKsxg2wVRhAEloZvc/swaqiJJinOc/Prs8a3Gn7XicFN72atIffzHJRMsl5/QKX/k4E77uHoJcfWSdo8zc0dRd8kLRBtqGPrsOfjf+wG1ly/sd+U99CGzO1x0vt3jyo4idPxPcG9biJSMHFq6o4ll4V32D/wf/4bEsONJjDqlhxEL/Y3lyQfLRClf0a3XEoAUqcS34h5i489vHo/Vm/TCKViKh+OlLbxaewyGaeKQe7fdJnz8j5HD+3rtfM49iwm+eTs1xTN6dWJGS1KsFqV+O43jez6LtjeIlrQMW152FYtcJ2c7DKGHLMvi129sZF15CLci84cFEwl62v7M49r8EnL9DlA87b9xDyKW4qHmmB+j1G7Gu/q/2Q6nFbf2NK5tC7t9fGz69dQveMj+ObfVmmYa+N//Ef6Pf0NswgXUL3iwV6rqC/2Pd/m/yHvmwm53VeYs+QsYcSJHf7uXI0txOEkWz+Q411ZMC97Qer81zQyU9U5XZ4qeat1SKvtu8oBSvgKAeHHfd6umQyRpGVas72NIZEO2wxB66OGle3hpXQWfmzGEP144ldH5bS/ZI9dtI/jm18n5+LcZjrB/i42YT8OpdxKdclW2QznA1Mn56Fd41j7c/XNIEsgOlMo15D962iFv0DmLf4539X+JHPElGk//W79rRRR6T2zSZeBwdWump1y/A8/ah4lN+Xyflr6JHPVNIsf/EICFGyp69dzOXR/gf/cHSLG6XjunkT8OS/GgVK7ptXMezFmxAguJROG0PrtGV4gkLcNOrbifP1h/ynYYQg+8v6Wav7y7lfnjC/nOaRM4elQ7BTsti8A7d2A5XISP/1FmgxwA4pMuBVcOUrTGXnopy1w738URLic2+fIen8t05yFHqwi+chNS4kDHcGz6dTSe+H/266GDpZyEgc/yFROdei3ujc92uTVNqVqD5QoQOfL2vgkuJTn8eIonn8SE4hzqY707FMO14y27lI2zF5d/kxX0oql9uvKA5XCTHHkSyA4Cax/o8/FvnRF/JTIsqeTgl6JE4gN/bNJgtLkqzB0vrgNgYrG/7UkCKe4NT+Das5jwsd/H9A/NVIgDiqNyLQUPHodry8vZDgXPhscxPQUkRp/W43OZweE0nPkvHHVb7Nmub30TklGMvLHEpl/f82D7MSlcgdy4Fzm0DzlcjhSuQIpUIUVr7HUyY3VI8QY7eU1GIBkFPQZGIq0JFwNJZNYtdmva0r916bjEuHOpvu7TdgfcO6rW4dz1Hug9rBRgmXiX/5ubizewam8D1aFEz87XgnP/EnsmZi+3FuvF01Aq1/bZayU6+yvUL3iInM3PU/DpL5HDvdvC2FVi4kCGGUoOfqJsbowxxi3GJw0ktZEEX3tqFQnDYkqpn+uOGtHuvlKkEv/in5McehSxqf2oS6+fMQpVjNxR+D/4CbUj52dtzJ4Urca17Q2i06/vtTeV5PDjCR/7A/wf/gLTFUCZcUPXBoBbFkr9duRkGMXjBqzmx9v8f/P32I/JTvt6GVrWRopW43/vh3h6UGHecrhJlh1NYuR8EiNOsgeH92JtsEMvaNllUBzuPlk2y8opITr1Gryr7iM85zbM1CoQHXFveoH46NOgnRYoz7pH8b9zB5Jl2Pdr2LEH7lf++K7dL0nGu/p+TvBPAa7nb+9v46dn98K6o3oMpXIN0Zk3NT+0dFcdk0r95Lh6lnYkRp2KJTvtxL43W+kAklEkS8dyBfDsXYyeMzTrH7BFkpZhliuASzKobggxpkgkaQNFQjf5xrNrqAonKfQ5ufPiae3P5CRVH8iyaDz5d6JbqyOyQuikX5P39AX4Pv0D4Xk/zUoYno3PIZnJXunqbCl6xBex3AGSpXPSW3Tb1HHu/QTX9jdwb1uIo2Fnj66fLJlJ6MRfHljCqo+4Nr9E4N3vIyUaicz+KkbuKLuloymJtEy7BItlpr4/8BjY+0mWhRStxLXrffyLfw6A4R9KYsSJJEaeTHL4vJ6XKTHiKBWrce5fgnPfZzj3L0WOVmHJLurPvY/kyPk9O38bIrO+jF44yV6LtRPKvs8ILvwKoeN/TPSIL7beaFn4Pv0DOUv+QmLkSUSn34Bz1/u4dr6D/4Of2k/PP4zEyJNIjDzJvl/u3E6vmRwyh9K9nyJL8MmO3llw3S5imySZKmOzdn8jtzyxis/PHsY3Tx7Xo3MnRp3SZzOi3dsXElj4VWovX4hn38dER53etx8S0iCStAyT3AEA6hvqgN6pHSP0Lcuy+OXCjazdH8LpkPjr56ZT4Ou4tSUx9kxqhn+C5QpkKMqBSx8ym9jUq/Guuo+4egl6ceYH7EanXoWROxqjcFLvnliSiE25suNdEiGcO9/BvW0hrh1vIcfrsRxuEsPnUTf1ZgxfMW63G5BavGG0eOOQJKyDt0kSjsY9+D75A3lPLSA25UrCx97R6wueS5FKAu/9APeWV0iWzKTxlD/2+B6GAblxD65d7+La+Q7uLa/iXf84liSjlxyRSkLmo5cc0WkroRSpshOy/Utw7luCUrEKybS79PTc0SRGzic5ZC6etQ+S+8pN1J/3P5LDj+9R/AezckqINyX/ltX+m75l4f/o1xi+UqJTr2m9zUgQeOe7eDY8SXTy5YRO+g04nCRGn2bfr4ZduHa+i2vXO7g3v4h33SNYkgN9yOzmVja9ZEabHxj10ll4Nj3P3LwIn9b6aIwlCXh6toC5c/9SgObloP69eDsAT6/cy1dPGNPhB9x0OOq22sMHiqf26DwHU8qXg8OFlAzhSNQTLTuuV8/frZiyHcBgI+UNZ8nuiSSTvdf3L/Sth5fu4ZX1FYwu8PKVeWM6XIxYSoTwrHmQ6MwbRYLWBeFjvot766v43/8RdRc9k/lPr4qnxwukd4UcLse17Q1c217HtXsxkpnAdOeRGH068TFnkBhxErhyiKQWWJd8Xe/WSQLx8efh+/ROvKvuxb31FcLHfs9eSaGnrbuWhXvT8/jf/xFSIkzomDuIzroF5N55SzEDw4hNudJOcE0dpXwFrp3v4Nr5Dr7P/kzOZ3diunNJDD+B5MiTiBcfjeErwVGt4dxvt5Ap+z5DSa3oYMku9JIZRGfcQHLokSSHzMHyFTdfLz7uHPKeu5Tcl6+nbsHD6GVH9crzaMn/3g/B1AnN/02b2107FuHc9ymNJ/0anAdmi0uJRoKvfhHX7vcJH/UtInNvP+T3wwyOIDbtanvNVyOJs3wZzp12kpvzye/J+eT3mJ58EiNORCo9lsiYswH7NdWUSF1Rup9Pa8fyxPK93HTsqB491/jECzByR2L5ivl0Ry0fba/F53QQSRo8snQ31x/ds3WLA298DcsVoP6Cx3p0noM5y1egl8zAtecjAGJlx5HttWEkqx/MqupNyaRh1dVFOtyn6Q+frxt/+HpqY0WIqx5cxm/Pn8IpE4oyfv2uyua96g/e31LNN59by6kTi/jluZPaLfbYdJ9KlvwKz+oHqLvkxT7vYhqo2ntNOXe+g+krwSiaktF4chb/AikZbvfNs1dYFo6ajXZr2bbXcVasAMAIjrKTsjFnkBx65CFJTm/9/jmqN+B/7we49n5CsnQWoZN+1e0CqVK4gsC738O97XWSpbPs1rM+Kiza5vVjtbh2vW8nIbvewZFa8cJUfMi6fb9MbyHJIXPtr6Fz7efayVJsUqSSvGcvQQ6XU3/+I90uQtuenPd/jHf1A9Rc9R5m7kFJkGWS//iZSMkINVe+Aw67JUsO7SX3petw1G6i8eTf2zOiu0iKVuPa9V4qyX0POVpJaNyFRM/6u72DkaDonsnUT76aI5acyrhCH49d33vl1i++91N21cW486KpfOPZtRT7XbzypWN6dE7/29/FveUlqm9a03sf6IwkRfdMIjrtOpTKVVixBvZf8FxG3vuKiwNLoe0a96IlLcP8bgWwaIyKlrT+zp7JuR6HBFfPGd5pNW5XxXI8qx8gOv16kaB1Q/N4IFNH0mOZmUSgR/Gse7TvWtH0GL6lf8Oz8TkcDTsAe5xY+OjvEB9zBkaBmpFWQ6NwEvUXPoV74zP4F/8feU+cQ2zatYSP/n/pj/OyLNwbn7Vbz/QYoeN+SHTmFzI2MaE5DE8+8QnnE59wfir53YC0+U0cod0wbC76kDkYuWO6fF8tXzH1FzxG3rOXkPvi1dRf+HivVvqPzv4K3rUP41v6N0Kn/KHVNufuD1Cq19Nw+t+bEzRH1TpyX7oWKRGyu2FHnNit61reQuITLyI+8SJ7Nucb38C37VWiyajdYudw0XjSr7EKVQrXRdhTH8OyLKRuvi7l0D58y/5BdMaNfFSfz666GOMKfcwbW8jE4hw2VobZXBlifHH3f7/14ul41z2M3LgbM9j+BK6uUGo2IBlx9MLJeFffT8OU63vlvD0lRjRnmLt6DRvd1xJa92q2QxE6UBtJcNtTq0gaJmOLchhfnNPxAUaCwsU/wPQPJXLMdzMT5OHINMh75mL8734/I5dzb30NOdGQ/mLqXSDXbyfv6QvJWfIXjLzRNJ70a6qvX0LdpS8TmXubPXYrk926kkRc/Rw1V71DdMYNeNY+SMEjJ+Fe/0Sn5Qzk8H6Cr9xI8M3bMPLHU3v566nuzcwmaIeQJIzCyTRMv4naY39CfNKldvHXbt5X0z+Uugsex3IFyH3hShzV63stVDNnCLEpn8ejPYXcsKvVtuSIE6m9+Fk78QScu94n71l7WaK6i5/pdoJ2CEkmPPZ8ZD2Ca8dbzQ/HJ1+GXjKTG44eSUw32VHT/dIezn2f4l39X6RkmF+/uQmAH5xht7TenOpG/fv723rwJGget9qbKw9I0RqMwHCwDHsS0bDsj0cDkaRlXF4gF5dkICUasx2K0I6kYfLNZ9dQGU6S61X480XT8DjbfzOSYrUMefVqXHWbCZ3068Gz9FNfDJWQHSRGnIBn4zMZWYDds/5xjMCIXl/03rX1NfKfOAdH4y7qz/0v9QseJjbtml5baLonLHcu4RN+Tu1lr2HkjiG46JvkPfs5HFXr2tjZwr3hSfIfPRXXrvcIHf9j6i56xi71cJgyg8Opu/BxLIebvOc/j6N2c6+dOzL7VkBuVTdNStXh0oceCZKMe8NT5L50Daa/jLpLXuj17v/YkKMwvEV4Nr9wIIZoDd4Vd3N6SQMAb26s7Pb5lf1LsRQvH4ZK2V0XY3yRj+ll9tJn88cX4nXKfLqzDt3ofp0zvXASlqz06soDyZEnUXPtxyg1G7AcbmIl/WOFbZGkZZjis6dEy8lwliMR2mJZFr98fSNr9odQZIk/XzSNkoC742NcQXRfCZUn/zWjg8+zRQ7tI/fFqyj612gK75tF/mOnkfv85wks/Co5H/wU79K/41n3GK7tb6KUL7dbDbpQdDMy56sYwVH43/0eGPG+ex4Nu3Dt/oDY5Mt6r0yKkSRn8S/IffVmjLwx1F72Wq8Ux+0LRtEU6i5+hoZT/oSjbiv5T5xFzvs/Rorbb9RyaB/Bl68j+NY3MApUaq94wy4Lke3Wswwwc0dTf8HjgETuc5cj1/Ws5af5vP6hRKdfB7LT/pCTCFPw+JnkfPhLu8TGkr8QfOt2kmXHUHfxs5j+3q/dhuwgPOpMXNvfal4NQzLi+Bf/nGHVi/G7Hfzvs93dPr1z/1KSJTP539JynLLEj848UHpGkiS+MX8cScPig609KPeheIhPvKh370/S/hvl2vU+ybKjQen4736miDFpGdbUyqLoIknrjx5euoeX11cgAT85S2Xq0HYWv7YsvMvvIll2FPqQOVSdYg/CPdynV7g3vYj/3TuQjATR6dch6THkaDVypBJnw06kaFW7H0BMZw6Wt4igO5/Y0KNJHHdH8/ibVhQvjSf+H3kvXYNv+V1E5n69T56LUrESS/ESU7s+GLstcmgvwYW34tz3GdFp1xGa92Nw9I8/9O2SZOKTLyMx5gxyPvk93lX349n0IrHJl+FZ8yCSmSA072dEZ9ww6Or9GfnjqLvgMfKeu5S856+g7qKnMYOd1zrrTPj4Hzd3x/pW3YscrSQ+6lT873wH77pHiamX2PUV+3Bd18iYcwhueBjX9jdSyc5QDP9QlPLlqMVHsnR3PVuqwowr6mSYx8H0KErVWnZPuIFPVtbypeNGMWVI61nuC6YN4Z6PdvDsqn3M78HkucZT7+z2sQeTEo0U3jud8FHfQqnRCKmX9Nq5eypjSZqqqjLwT2AmEAdu1jRtc4vtVwHfAgzgPk3T/tXZMQOS4sWwJNyGSNKyqS6aZFt1hK3VYbZWRdhaE2FrVZiaSJJTJhTx1RNGMyK/7ZRLSjQSeOubuLe+SmT6DeipKeyHMyneYFeT3/gMyZIjaDz9r+0v/JyM2olbtMr+ilTZyVvqMRr2kLvq3ySqV9Fw5r+xvAWHnmLUycTGnYdvyV+JTfxcr7w5Hiwx/jyqRp3SK1XLnbveI7jwq0h6jIYz/kF8wgW9EGHmWJ48Qif9ktjky/G/9wN8y/5BouxoGk/+Q1pV8g9XRqFK3fmPkvf8ZeQ9fzl1Fz3Z89YbSQIjiXflf8j55HfER51CzrK/49r5DuG5Xydy1Lf7fKxivHQORs4Q3JtetCcUAMnSOTj3L+XCI4ewdHc9jy7dww/P7NqsXaViNZKpc9e2YjyKzBWzhx26jywxrtDHh9tr2VEbYVQ7f2c7lVotwvSVgKuLyeQhca9CMvXmxeATvTUGsBdksiXtQsCjadqxqqoeA/wRaPmX7A/AVCAErFNV9THg5E6OGXgkiRA+3FbHZUKE3lEfTbI1lYxtqQqzpSrC9poINZFk8z6yBE6HjGlaHDUyj5+eNRFvO0uXOGo2Enz1Czjqt9tVwWd+IVNPJWucez8m8MbXkcP7CR/5DSJzbmu7Baz5AC+mc3i7iVUkEiFny/MULv4h+U+dR/0597VZADV8wk9JjpzfJ8v1SJEqLHeg5wmaaeBb8hd8n92JUTCRhrP+PaDHa+klM6j73PMoVWvRi6YOutazthjFU6lf8DC5z19B7vNXUHfhU1g5JT06pxypxP/RL+3/1+9Aqd9O48m/67Twca+RZOLjz8O7+n9I8Xosdy76kDl4trzEmcNNfiLBh9u73h1pFExgycxf8+InhYwq8aaqGRxqwfQhfLyjjr+9u40/XNi9grRK+TLyn76A+nPuIzHmjG6d48C5lgPgaNyN6S3EKJoM0ViPztlbMpmkzQNeA9A07WNVVQ8elbcKyAV07HLaVhrHHMI0zebaQu2JRrObIN1W/CDVUZjXSZz9QbbvVVcZpsV9n+3hk5317KqLE9NbD04t8bs4emQQCXhlQzV+l4PSgIsSv4tSv4thuW5isRiWfuibk3f76+S9/10sxUv5WQ8QH3IURO1xDAPtPqXFSJC37C8E1/wHPTCCinMeJVFyBMST2KVSuycajRAtO53k2aMpfutW8p66gKoTf0901EFjt6QgjD4fojGkZAjL2XsTMore/RGuqtXsvfj1brdayLEait79Ft69iwmNu4CaY3+G5fRBL/5eZ+11lTOu37xJpaPP71NgIvHT76Fk4U0En7uM8rMfwvQc2gKcNikH56QrydnyIo7QPipOu4vY8JN69bXTnqZ7VT/8DHwr/wMbXiQy4WL0vCn4AXZ/wvDcMnbWxaisbSCnnUSrbW6+tm4CDST400kj230vPn54Dh5F5sNtNTSGwjjkrv8OSt5R5Eky1t5lRErndfn4lnL2LiEZHI1z7ydEhx5LJBrL8O9e+4XPM5mkBYH6Ft8bqqoqmqbpqe/XAEuxVwV5RtO0OlVVOztmQPJ6fEQaRXdnX9hUFeGhZfvJ9ygkDJNcj0JRjpPhuR5G5ns4ZmQu04b4SRgmt80bibeDWZuHkBUSBZOomv9njH4wS68vOWs3UfTet3HVrKdx4mXUHvU9LGfPuhQOliieyf4FT1G86FZKFn2F2tnfoGHGLYckTcHV9xJY91/2XvRqr8ycleP1+Ha8TuOES7udoLnLl1L0zu044rVUH/cLQhMvy/oaf0LfipfOoeK0f1Pyxs2Uvn495Wf9D9Odl9axcrwed8Uy3BXLcVcsw1W5CtmIoXuLqTjtbhJFvbu8UToSxTPR/cPI2fYK4QkXkyicSt2s20nmTWD+OBf/W7qPxTvqOWNiYXontCwib/+SoaGxBIqOYEIH49kkSeLEsXks3FjDi+squXBa11smLaePZO5YXNVru3xs6xNZuCpXkSicgm/3O0TLepbw9bZMJmkNtE4X5aZkS1XVGcC5wBjs7s6HVFW9tKNj2iPLctoVgrNVRf/8qruoj8Tw+f6Tlet3x0BZcWD5fnvq+F2Xz2RMoa/dgozpPhspXIFHe4rorC/DpAU0qufi7qALaKDcp3ZZJt5V95HzkV1KpKkrwdv5kV3m8/nAN5aGzz1DYNH/I3/ZnXgbttB4yh9bLYvD6BNwLPkdRav/TviEn/f4up6tTyEZCYwZV3f952VZeFfeQ85Hv8L0D6PuvBcwi6f1+YSRAf+6ypA+v0/jTqbBeR+5L9/AkDe/QP35j2K5D5pclBor1byI+74lKLUb7U2SA714GrGpV9nLUw0/HsWTn5UZfD6fj8SEBfbYODmO5csnedy3cQE3FCR5ZPl+dtQn076ncsNORu34H1PkGzj3nM93etzt8yewcOMnPLmqgiuPGt2t52CWzMC9Z3GPfu5SrA7J4cDhtCf5SONObXW+bP/uZfK1sRhYADyRGl+2usW2eiAKRDVNM1RVrQDyOzlmwBqu72C01JDtMA5LizZWATAiz9PtitlNlH2fEXztFuREPYmxZ6WKZB6+Y3Tk0D4Ci76Fa9d7xEedSuMpf2i1vmGfUbw0nv439KLJ5Hz0Gxz122g4+97msWh66RHEpl2Ld/V/iU+6tMdV4D3rH0cvnIJe1LWF3KV4PYFF38K99TXiY86k8dQ/YblzexSLMPAkR55Ew9l3E3z1ZnJfuob6c/+Lo25rKiGz1w2Vo/bfIdOdS7J0NvGJF5IcOpdkyRG9MlGlt8THn49v+V24t75KbMqVyA07cW9bCNOuY86IXN7eWMltJ45J629pxYYPKATq8md2XvwbKA64GV+Uw5aqMNXhBIU5XZ/NqhdPx7PxGaRwRbfHCVqePGqu+4zcF65Ez5+A6R/arfP0lUwmac8Cp6uq+iH2mLMbVFW9EvBrmna3qqr/Bj5QVTUBbAH+iz0+rdUxGYy3zyQcOeRRTiRh4HMd/jWHMmlPfQyvU8ap9OC+WpbdmvThLzACw6ld8GD7MxkPE67NLxF457tIRpzGk35NbOrVGa+GH519K0aBSmDhV8l/8lzqz/lP88zZ8DHfwb3lFfzv3EHd517odq0uR9U6nBUrCc37WZeen1K52k7YQ3sOTBgR3ZuDVmL0aTSc8U+Cr3+ZonsPfGgwgqNIjJxvJ2RD5tprmvbjD3Z68XSM4Cjcm14kNuVKnOUr8X/wU5JDj6TE7+GTHXW8t6WGk8Z33uW5d/0HjLLcfPmCsw/daCQIvHEbyA704hnoxdPQi6fzq/Mmc9l/l/DKunKuObLryzslS2eRLJ2FHKvF6O5kDssEI4Fz36dEMzVxowsylqRpmmYCtxz08IYW2+8C7mrj0IOPGfB0xY9firIrFGd0Qf/5VDXQJQ2TcMJgXGEP7mkyQuDt7+DZ9Bzx0WfQeNqd/a+1xIjjqN+Bo24rjrotyJFKLFcAy52L6cnDcudhunMP/OvJbbdelxRvwP/+j/FoT5EsmUnj6X/LakKaGH0adZe8QO7LN5D37KU0zv8N8cmXYblzCc37CcE3vopz78ckhx8PRrLjWaZtkCyT+KhTiaXKDnTGUbcV32d/xr3pOUxfCXUXPmlXhhcGvcS4c6g/7wFcez4kWXIEySFzezzrM+MkidiE8/Et+wdSpIpkalF5Zf8yPjfzc7y4tpxnVu7tNEn7aFsNYxpWURGcSkneoa1oHu1pPFtewvQW49n0fPPjebmjeTAwnOWfjUYZci5G8TQsT37a4etD51J3yYtp79+W3BfsxEzSY723/FYvEsVsu0gpX4Fv2d+7vSROcthxmE4/OcSoaBRJWm9attueY6KW9mxwuaN2E6Fj7iA6+yvZ+xRsWcjhfTjqtuGo22InZLVbUOq2IjfuQmqx1qKl+JD0jmciWYo3lbgdSOQsdy7OPR8hh/YQnnu7XTS2i0lPXzAKJlJ76UsEX/8ywUXfJFK9gfBx3yc+4QJi298gOdSe5B189SaUag29aErqayp60RTM4Mh2f2568TQaznug0xgcdVvxLfkL7o3PgsNFdOYXiMy+tc2absLglRw5n+TI+dkOo0fi4xeQs/RvuLe+QmzqNRg5pTj3L2XqjBtwKzIr93Y+NOdPC1fzprSTmtFttKKZBt5l/yRZPJ26S19BilajVK7GWbkGpXIVk0PLOMH6AF54CAAjOBK9eBrJ4hnoxdPRi6d3/HtnWUjxui4ldy1jc+5fhp4/DktWSJYd0/Vz9DGRpHWRlGhEbtjd6k0yXXJor70grO9MAlKUqtDAmeI+ECxOLTNy9Khu/LI2cfrsT2Z9WO37EHoM97bXcdRsSrWObUWp29oq8bIUL3reWJIlMzEmXoiRNw4jbyxG3lh74LKp26/NWC1SvB4pXo8cq7P/jdfbg2Pj9cjxOqR4HY6GHan6SHnUXfxsvyvIa3nyqT/vQXIW/xzfyrtRajUazvgnjWf8o3mfxOgzsFwBlKr1uHa81fw7WXvJi+ils3Btfws5Um6PPytUUao3ICUj9jqd7SRxbSZns27JzNg8QcgCo3Ayev543JteIDbtWvQhc3CWLwNgcqmfFXsa2FkbYWQ7RWff2VzF/lCCPxfczo1Tz8I4aLt7yyso9duoP/MukCQsXxHJUSeTHHUyAPsbYpx2z1ucGtzLz+YkUCpW46xcjXvLKwdiDI6i4ax/Ny+s3pJ/0bdx7VlMzbUfd/m5O2o3IukRpFgdydI5/XLdZZGkdVFyxAnUXf5at47N+eCneNY9Rmz2pVy6ayRXuMV4tN5U3hhHlmDe2DSnjLdkWQQWfoX4+AUkxp3T+8G1QwpXkPvqTTjLl2NJMmZgBEbeGKJlR6eSsHEY+WPthbk7atWTFSxPPkZ3Pk32Vw4n4RN/gVE4Cf97PyTvqQU0nHM/Rv44AGLTriY27Wp7Xz2KUrPRLsJaYBfGdW94Es+WlwCwJBlL8WK5AtRc+4k9wrXlpeq24lvyV9wbnxHJmTC4SBLx8QvwffZn5PB+kqWzcW95BSlSyflTh7BiTwOPLN3DHadNaPPwPyzaQgw3J51/C8bBQ00sC++yv6PnjSMxto1WNmBI0EMgr5in6wLcOOEoSmZ77LBidShVa1EqV+Nb9g98n/6JhnPvO+R4I388jg2PI8Vqu9ya5ixfAYCjcRfxyZd36dhMEUlaBlmuAHIyRLBkNJ9Z9VxiioHHvak+lmRyaYCgp+sva6V8GZ7NL5IccUIfRNY2R9U6cl++HjlWS8MZ/yQ+5gxQPBm7/kARm3oVRv54gq9+gbynFhA+5jvopbPQCyaCkirVoXjRS2ail8xsPq7xzH8SbrgDpWqd/ce+egPxcee2mnQg120jZ+lfcWvPgMNJdMbNdnI20MYWCUIPxMefT85nd+Le/DKJkScT1mOAxNlTSvjVm5vYXNl2Xc+3N1VR3hjnWwUfMSFkkCw8udV25853cFatpeGUP3Y42efao0bwy4Wb+Pv72/n5OfaHLMuTR3L48SSHH4+UaMS35K/I9Tswc0e1OrZptrdSubrLY8qU8uVYihdJj5LI4N/+rhBJWgY1DUD3163hG8qT7N1zHUwUn9R7g2VZrNvfyJmTuvfm6ln3KJbiIz7+/F6OrG2ubQsJLvwqpifX7m5soxlfOCBZdjS1l75C8NWbCbz3Q8BuHTPyxtnj0QonY6TGppm+UnvmpSRj5o4mkTv6kNbRVsmZrBCdcZNIzoRByyiYgF44CffmF4nOvIlIoQrYCcLpajEfbqvBMK1DVgb449tbAItbjIcxNu9v7sJs4lv6dwz/0Ob1QdtzwbQh/P6tzby9qQrLsg4p+RGbdg2+Zf/Au/oBwvN+3GqbXmwXAu5OkuZo2IXpzkWSna0+4PUnIknLINNlFz0MNm7l68qz/KJyPtDpSldCGtaXh4gmTepjXV+uSEqE8Gx6gdiE8/t+TIJl4V3+L3I++jV6yQwazrkPM6e0b695mDCDw6m77BV7rcPqdakWsvU49y9tNWPM9BSkErcpBxK4ggngcCHXbydnyV9xa0+nkrMbicz6skjOhEEvPv58cj75HXLjXqREI47aTSTGn8cJ4wp5dX0F726u4pQWjQqbq8KUN8b5/Jgkzn01xEpbj2tV9n6Ka98ndrmbTsb4SpLE+dOG8NTKfazd18C0stYz6s2cIcTHnYtn/WOEj/pWqwXVLU8+RnAkSuWaLj/n+gUPU/DgMSSHHwdy/0yH+mdUhynLbS+eEEithWYlGrMZzmHlgy3VAMwZkdflY92bX0DSI8SmfL6XozqIESfwzvfwbHiC2PgFNJ76pwPddUJ6JBkzbwyJvDEkxp174OF4PUr1ehxVqeStej3eNQ8gGXEALFnByB2Lo25LKjm7gcisr4jkTBBSYuMXkPPJ73BvfhE5tAfvukeoGnsWs4fbCdN9n+xqlaTd8+F2clwOvj6hBvZBcmjrJM237O+YnoK0a4/desIYXlpbznNryg9J0gCiM27Es+l5PNpTxKZf12pbsnQ2ktH1iXiOhu04QnuJzPlql4/NFJGkZZCVaklzOuymXCkRymY4h5Xle+3yG/PHF3X5WOeu99ELVPTS2b0dVjMpWkPw1S/g2vcJ4SO/QeTIb4piqL3IcueSLDum9RR6U8dRty3V6rYeR80GEiPnE531JdF6KQgHMfPGkCyejnvzC0SP+CLSqvtQqtdTWDydPI/Clqpwc1fkG1olizZVs2BqKbk1r2A6/Rj5E5vP5ahci3vHIsJHf6f1Em8d8LsVjhqVx0tr9vOVeaMo8LWu7aiXziZZMhPv6vuJTbum1USqxtP/1uW/p96V/8G7/N8AJIb3z/FoAP23FPJhqHmNN9OepKzoIknrLduroyiyRGmg7aKtHWk845/Unf9onyVNjpqN5D91Hs6KFTSc8Q8iR31LJGiZICsYBROIT7iA8LF30HDufwnP+7FI0AShHfHxC3BWrMTw2b8jyn67FMfckXnopsWH2+0yR396ewsANx0zEmX/EvTSWa0mBviW/QPT6Sd6UItXZ44elY9hwT/e337oRkkiOuMmlNrNOHe9d8g2LAvMDpf2bkXZtwQ5XocRGIGZO7pLcWaSSNIyqGlMmmTZLySn0faMGaHraiMJCnK6XohVijfYtXv6qNvLueNt8p6+ACkZpe7CJ4lPuKBPriMIgtBT8fELAHDu+wzDV4Jz/1IALp89DICnV+zj9fUVVIUTzB6ey7A8L9EjvkR0xo3N53DUbcW95SVi06/t8motl8wcilOWeFOraie+8zB8JXhXtS7FISVCFN4/65DHO+KsWA5m0p7V2Y8/NIskLYOaWtIkPc59jsvY7p6U5YgODxWNMQwLJpUEunagHqPgoePxLflr7wdlWXhX3kvuy9dhBkZQe+nL6EP6rjtVEAShp8zgCJKls3BveQm9dBZKqqjtEcNycTkklu+p58537Va0n5xpzwCNqxeTGHN68zm8y/8FspPIjJu7fH1Zljl2TAGRpMHbmyoP3cHhIjb1atw7FuGo29r8sOXyYzmcKBWr0rqOFKnE0bgHydRJ9MOloFoSSVoGWS47iZD0MG+V3MgW1+QsR3R42FRpV+a/au6wLh3n3voqcqyWZG9X2zeS+N/9Pv4PfkJi9OnUXvwsZqCsd68hCILQB+Ljz8dZtZbE8OOJq5+zFyAHZg7LJRQ3qA4nmTMil7I8D85d7+Pa8nLzsXJoH54NTxGbfHm3eye+duIYAP7z0c42t0enXYMlO/Gsur/V43rRdJSq9GZ4NhWxtcBeB7gfE0laJsmKvc5ivJHxiXXkNW7MdkSHhZWpSQMTi7tWPsOz7lGM4Ch7maBeIsXqyH3pGrxrHyQy+1Yazr6n1XRxQRCE/iw+3p41LccbiBz5jeYB+jcdM9J+XDrQiuZd+R9yPvl987HeFXeDZRKZdUu3rz+6wMeQgJtNlWEiiUPHmFm+YuITzsez4QmkFhUS9JLpOGq3QKLzYURK9Tos7EK43VrzM4NEkpZhpjuIlKjnCzW/44rkM9kO57Dw2voKHJI9Oyhdct02XHs+JDb5il5bRF2p30be0+fj3PsJDafeSfjY72VvgXZBEIRuMP1lJIcehXvziyiVa1BS49JmlgUZGnRz87GjGJrrAcvCWb6MZKo+mhSrxbv2YeITLsAMjuxRDF8/aSwW8M7m6ja3R2fciJwM41n/ePNjevF0JCyU6nWdnj86/UaQHCRHnNSjODNBvINkmOUKIicaiUk+cohmO5zDQlU4Qa63a5MGvOsfx5IcxCZf2isxeHe+xZCXLkOO1VF3wePEJ/XOeQVBEDItNn4BSo1G4I3byPn4twAoDpnnbj6Km1Mtao76bcix2uaxtt5V9yHpESKzb+3x9U+ZWMTwPA/PrNzX5na9ZCbJIXPxrrq/uTtWL56GJSs4GtruJm3JufdjJMvot0tBtSSStAyz3EGkeANxRw4BKUokYWQ7pAGtJhwnaViMzO9aUVjTHSQ26VJ74fKeSEbwv3MHJW99GcNfRu0lL6KXHdWzcwqCIGRRfNy5WJKM5fTa47dSpS1kSWpesqmphS05ZC5SIoR31X3Ex5yJkVpSqidkSeKIYUFW7m1g8db2W9McDTtw7XgbANNXStUXNtjj6DrgqNuK/+1vYzlcJIf2/xV/RJKWYaYrgJRoJOHIwU+UilA82yENaO9tsev2zCgLdum46OyvEDrlDz26tlKxivwnzsKz9mHqp32Bfec9ecjiv4IgCAONlVNCsuwY5NBeJD2Co+bQ8dPO/UsxXUGMggl41j6MHK/vlVa0JlfNHQHAvz/c0eb2+NizMXKG4F11r/2AJIHi6fS8yv5lOKLV6MUzwNH1upqZJpK0DLNb0uoxnH5yiFHZKJK0nvh0Zx0AJ4wrTPsY19ZXkaI13b+oaeBd+nfynj4fSY9Sf8Fj1B35/zpdn04QBGGgiI8/H0fELoPRVC+tpcSoU4kc9U0wk3hX3E1i2PG9WmZofFEOxX4XG8pDhOJtFKl1OIlNuw7Xrveak0i39hQFDxwNevtDiVy7FwMQH3tWr8Xal0SSlmGWOxc50Uhj3iRWWuOIJkV3Z0/srosiSzB9aHotaXJoH8HXvoRvxd3dup7cuIfc5y/D//FviI85i9rL3+j3U7gFQRC6Kj7uHCxkLMWLM1UvraXEmNOJzrwZz4YncUTKicz5Wq/HcMnMMizg34u3t7k9OvVKLIcb7+r/AmA5c3CE9qBUb2j3nM69HwOQGNn/Jw2ASNIyznIFkOIN6LNu4evJr+JWHJ0fJLQrppscNzofh5xexWjPhieRLJPo5Mu7fC33pufJf+x0lMo1NJx6J41n/gvLk9fl8wiCIPR3lreA5IgT7LFbJTNbbXPUbrHroyVC+Jb9i2TJzD75sHrNkcNxSPDyuop2YiwkNuFC++96rA69aDoASmU79dL0GHLjbkzFh1EwMIrJiyQtw0x3EMlMEnAk8RCnLiK6O7srmtDZXh1hUmmaKw1YJp71j5EYdhxm3pi0ryPFGwi8cRvBhbdiFEyg9vLX7dmb/XgpEUEQhJ6Kj1+AHK9HLzmi1ePuzS+S+9qXcG96AUfDDiJzvtonfw+dDpkjR+bRGNfZWRtpc5/ojBuR9Cie9Y9jBoZhevJRKle3ua9StRYJC714+oD5+y2StAyzUut3Fqy/jw2eG1i7vfPpwkLbPtxWa1eMNsy09nfuXoyjYSexKVemfQ1l76fkP34G7k3PEz7ym9Rd9HS/XoxXEASht8THnoUlKXhW3Y/csKv5cWX/UvT8ifhW34eeP4HEmDP7LIYfnDGREr+Lhljbi6cbxVNJlB1td3laJnrx9HaTNEu2JwrEJl7cV+H2OpGkZVjT+p0+t13XS482drS70IGPttcCMHNYeuPRPOsexXTnpjdg1Eji++T35D13CUgO6i5+xh4kK6dfMFcQBGEgszx5JIbPw7PxaTzrHk09aOIsX4bpH4JSvcGe0dmHRbuHBD28+MWjmdbBuOPojBtxNO7Ctf0N9OJpKLWbwEgesp9r9/sAJMec2mfx9jaRpGVY0/qdfpedpJnxhmyGM6CtK7cT3CNHpresR3TmzYRO/GWn07Tlum3kPXMROUv+Qly9hNrLX0fv7fU9BUEQBoC4eiES4Nr5DmDXGZPj9cj1OzACw4lPuKDPY5A76ZpMjDkTwz8M76r7iMz6MlU3rgbHoQXOfSv+jZFT2vP6mBkkkrQMM925ACiO1IsuEcpiNAPb3voYXqeMx5ne5At9yGziEy/scB/3+scpePxMHPXbqD/zLhpP/ROWq2trggqCIBwuEmPOxJJke7kl08C5bwkASsMOe43ONpKhjJMVotOvw7XnQ+TQPnAeWtxcCu1DjlZhBHq2ZFWmiSQtw5pa0pqWsnAkRZLWHbphEk4YDA12XrwQyyLw1jdwpipTt8e56wOCi75FsvQIaq94g8T483opWkEQhIHJcgXQC6cimTqO6g3o+eMw/GWYnkJi3Zgl31diUz6PpXjwrr6fwJu34/ukdbFyj/YUYNd3G0hEkpZhTWPSJFMnYTlwmGJ2Z3es2mt3E08q6byVSylfZtfyCbe9DlwT78p7ML3F1C94ENNf1itxCoIgDHSxSfZAe8/Gp8HhxhHaS+SIL4DSteX4+pLlySc28XN4tGdw1G/HtevdVttd2xdhAbHJl2UnwG4SSVqGmanZnZbDzWVFz7Mp/+QsRzQw7a6PAXD1kcM73dez7hEsxUd8/Pnt7uOo24p7x1tEp109IJYKEQRByJTYpM/bXZ6Va/G/90NMp5/YtGuzHdYhojNuQDLiWJKEUrWuec1RAKV6PZbixcopyWKEXSeStExTvFiygpxoxO9WCMXFigPdsbEihNcpM64op8P9pEQjnk0vEJtwfodjyzyr/4slO4lOvaa3QxUEQRjY3H7iY89BKV+Gs3wZ8dGnNvcK9SdG4SQSw45HqdmEZMRx1G4CQIpUISdDGPnjsxxh14kkLdMkqXnVgTv2387xtc9kO6IB6Q2tErfi6HTWj3vT80h6tMPaaFKiEc/6J4iPXzDgPmUJgiBkQnzcOcipNTEjR/2/LEfTvujMm5DjdcCBlQdcuz8AsIvuDjAiScsCyxVESjQw2tzJMGt/tsMZcEzLojaSxO/qfFana9sb6AUqeumsdvfxrH8cORkiOvOm3gxTEAThsNGyJ8LMG529QDqRGHUqRmBEqnvWLmrr3P0+pjuXxJiBsah6SyJJywLTnYuUaCQq+fATxbKsbIc0oGwob8QCJpR03NUJ0HDOvdSf+9/2lwAxDbyr7ic5ZC76QevTCYIgCLbk0KMB0HPHZjmSTsgOe2yaZRIfdx5YFu4tr2J6C0EeeGtliyQtCyxXADnRQEz24peiRBJiXFpXvL+lBoC5I/I63lGPgqxgBke0u4tr59s4GnYQnXFjL0YoCIJwmHHlULfgYeoveiLbkXQqNvlyLMWHd/2jdvHdRAOY6S0f2N+IJC0LLHcQKd5AwpGDnyiV4US2QxpQVuypB+DEcYXt76RHKfzfsXhX3NPhubwr78XIGUJ87Nm9GaIgCMJhJznypAFRrd9y5xIffRruDU/hWfkfAJLDjslyVN0jkrQsMFNj0pJKDn4pSmVI1Errih01URyyxJAOCtm6t7yKHK1CL5rS7j6Oag3X7veJTr++f1TNFgRBEHpFfMIFSFh41z4EQHLEiVmOqHtEkpYFdktaIxunf49vJ2/BJXc8Q1E4wLIs4rrBqROKOtzPs/5RjOAoksOObXcf76r7sBzuDmd+CoIgCANPYvSpWJKMhD3mO9nB5LH+TCRpWWC5AsjJEL4hE9lqlRHVB2ZfeTaUN8ZpiBscMTy33X0cdVtx7fmI2OQrQGr7JS7FavFsfJrYxIuwvAV9Fa4gCIKQDbKCkTsaANMVwAx0Xvi8PxJJWhY0FQHM3/k6dyiPsLEinOWIBo5Fm6oAyOmg/IZn/WNYkoPY5Evb32fdo0h6TJTdEARBOEwlhx0PQMNZd7c/w7+fE0laFjQtDVVYu4wvOl5mZ41I0tL12c46ANQO1uy0ZCfxCRe0P8DV1PGu/i+JYcdhFE7ugygFQRCEbNNLpgNgdDDDv79TMnUhVVVl4J/ATCAO3Kxp2ubUtiHAYy12PwK4Q9O0u1RVXQ7Upx7fpmnaDZmKua80taR5XS5kySIWDWU5ooFjS1UYWYKxhb5294kc3XE1bNe213GE9hI64ee9HZ4gCILQTyRGnUrdhU8MiBmp7clYkgZcCHg0TTtWVdVjgD8CFwBomrYfmA+gquqxwC+Be1RV9aS2z0/3IqZpEolEOtwnGu14e18zLBe5gDM1YSAZqes05mzJ9r06WGUoQdCjEI1G29zuiJRjOdyY7rx2zxFYfg9J/3DqSo6HXrrv/e0+9WfiXqVP3Kv0iPuUvkF1r6QA5B8BCRMSXXvemb1PgXa3ZLK7cx7wGoCmaR8Dcw/eQVVVCfgb8GVN0wzsVjefqqoLVVVdlEruBrym7k5HKkmTEqIlLR3V4QS6aTEyr/3SG7nL/0rZ02dAO6s4OKvX4SlfQuPkqwdk9WlBEARh8MhkS1qQA92WAIaqqoqmaXqLxxYAazVN01LfR4A/AP8BJgCvqqqqHnRMK7Is4/O13xXWUrr79TY5WQyA02knCQ4jkrVY0tUf4lu42X75HDE8r914PDXrMYqn48tpe8mowEcPYyk+zJnX4HP3/nPqD/dpoBD3Kn3iXqVH3Kf0iXuVnmzfp0y2pDXQuk1PbiPZuhq4u8X3G4GHNE2zNE3bCFQDQ/s2zL7XNCbNcvq5U76OpLc0yxENDLXRJAAXTG9nfIERR6nRmgeLHkyKVOHe+DyxSZdiudsv4SEIgiAI/UEmk7TFwDkAqW7L1W3sMwf4sMX3N2KPXUNV1TLs1rh9fRtm37NcqVxVdvBe/qU0OjsuzCrYNlWGKQu6GZnf9icbpXoDkpkkWTyjze3etQ8hmQmxTqcgCIIwIGQySXsWiKmq+iFwJ/ANVVWvVFX1iwCqqhYDjZqmtRxMdC+Qp6rqB8DjwI0ddXUOGLKC6cxBitYwJ/4xcv2ObEc0IHyyvZYcd/s99EqlnffrxW20pBkJPGv+R2LkfIz8cX0VoiAIgiD0moyNSdM0zQRuOejhDS22V2KX3mh5TAI4LNfssdxBpFgtPwzdz8/064Hzsx1SvxaKJ2mI6wyT2580ABLJ4hmYwZGHbHFvfglHpILQjD/0XZCCIAiC0IsyOXFAaMFyBZF1e4qvzxpEU6K76ZPtdQBMHdL+VOXY1KuITb2qzW3eVfeh540lMXJ+H0QnCIIgCL1PrDiQJZY7iJSMkETBL0Wx2ikZIdg+3F4LwHFj8tvewdSR4g1tblL2L8VZscIei9bOWp6CIAiC0N+Id6wsMV0BpHgDUcmHnyjRpJHtkPq19eWNAMwdkdfmdqVqHUX/mYJr+5uHbPOuug/TFSCmtr+WpyAIgiD0NyJJyxLLFURKNBCTc/BLUSpD8WyH1K/trY/hdcp4XW330CuVqwDQ88e3elwO7cO95WVik68AV9u10wRBEAShPxJJWpZY7iByvIFteceyzhxFY0y0pLUnoZtEkwYnjS9sdx+lYjWmOxczOKrV4541D4JpEJ1+fR9HKQiCIAi9SyRpWWK3pDWy76ifcI9xHkjZjqj/2lIdxrRg/vj268kplavRi6aB1OJG6jG8ax8iMeYMzNxR7R4rCIIgCP2RSNKyxHQHkMwkAbOBfBqoiSSyHVK/9c6mKgCGBNxt72AkUKo3HLLSgGfjc8ixGlG8VhAEQRiQRJKWJZbLXpZowtIf8YzrJ6zc0/bMRAE+3VkHwLDctmukyeFyTP8Q9OKZBx60LLvsRoFKcthxGYhSEARBEHqXqJOWJZbbrveluNx4pSj1qXUphUPtqo3ickjk+VxtbjeDI6i55kNoUcbEufdjlOp1NJ78u9ZdoIIgCIIwQIiWtCwxXfYi626nCz8x6mMDf7WrvmCYFg0xnZL2ujoBzNS9a5GMeVfdi+nOIzbxoj6OUBAEQRD6hkjSssRyNyVpCl4pQSQWy3JE/ZNW0YgFTChqv3xG3jMXE3jz9ubv5YZduLYttFcfULx9H6QgCIIg9AGRpGWJlWpJczjsH4EVD2UznH7rvS3VAMwZmdf2DkYSpWotpvdAeQ7v6v8CEtFp1/V5fIIgCILQV0SSliVNLWlIDqqtIA5DtKS1pSaSRJElTpnQdvkNR81GJCOOXpya2ZmM4ln/GPFx52AGyjIYqSAIgiD0LjFxIEuaxqQZ+eO4IvggZcG2Zy4Odjtro6glfor9bY9JczatNFAyw/6+YgVyvJ64eknGYhQEQRCEviBa0rJF8WDJTuR4I363QighJg4czLIs1uxrpNjf9qxOsIvYmk4/Ru5o+/vy5QAkS2dlIkRBEARB6DMiScsWScJyBZAbdvCnqi8RrPg42xH1OztqI8R1k5hutruPHNqHXjwNJPul7KxYgREcheUtyFSYgiAIgtAnRHdnFpnuIFIizBh2EzDqsx1Ov/PBlhoAjhiW2+4+DefeD/qB8XxK+XKSQ4/q89gEQRAEoa+JlrQsslxBSE0Y8FnRLEfT/yzdXQfAieM6aRVT7PF8cng/jtA+dNHVKQiCIBwGRJKWRZY7iKzbyVkOEawWFfMF2FwVQZJgXDs10tzaU+Q/eipSxF7bUylfAYjxaIIgCMLhQSRpWWS5AkiJMAABKUo0aWQ5ov6lKpQgz+tEbmdZJ2f5cuTGPc3jz5zlK7BkBb1oSibDFARBEIQ+IZK0LDLdQaRkIzHJi58olaFEtkPqN6pCcXTT4pjR+e3uo1SsRi+e2jxpQKlYgV44RawyIAiCIBwWRJKWRZYriBRv4MU5/+Mf+gXopujubKJV2C2MF04f0vYOpo5SvQ692K6PhmWiVKxELz0iMwEKgiAIQh8TszuzyHIHkZNhXEXjqCVJJCG6O5t8tN2e2TmmwNfmdkftJiQ91rzSgKN2C3KiUYxHEwRBEA4boiUtiyxXAIBRG+/hC46X2Fwl1u9s8tH2WgD87rY/RyjVG4ADKw00FbHVS47o++AEQRAEIQNES1oWmW67/ldZ9WLOdMR4sS6e5Yj6j/LGOH63gtPR9ueI+MSLqB52HKavGLCL2JquAEb+uEyGKQiCIAh9RrSkZVFTS5pD8eAnSm00meWI+of6aIK4bjI8t+P1TM2c0gOTBsqXo5fMbP5eEARBEAY68Y6WRZbbXmRdcbrwS1EaYyJJA/hwm93VOWNYoO0dTJ28pxbg3vS8/b0eRaleL8ajCYIgCIcVkaRlkemyuzudTid+ojTGxSLrAO9vrQZg/viiNrc7ajfjLF8Opn2/lMq1SKYuxqMJgiAIhxWRpGWR5U51d8oyfqKEYiJJg6bxaA5mlrW9ZqdSuRqgeWans2KF/b0ovyEIgiAcRsTEgSyyXHZ3p5E3ll9JEyjKET8Oy7LYVRtj/vgiXErbnyGUilVYig8jz54koJQvx/CX2WPUBEEQBOEwIVrSsqhp4oDlCvBh4Gxkh0jSdtfFqI0mGV/Udn00AGdlaqUB2WF/X75CtKIJgiAIhx2RpGWT7MB0+pFD+zglvoj62spsR5R172y2F0uPJs22dzANlKq1JFNdnVK0GkfDDpIlYtKAIAiCcHgRTTdZZrkDyA27+F7iUT4X/VW2w8m6T3bYMztPmdD2pAEkmZor323+1lm+AhDj0QRBEITDT1otaaqqLlBV1dHXwQxGliuIZNoLq3vNSJajyb5NlWEcEowubKe7U5IwA2WYgTLAXlTdkmSSTWt4CoIgCMJhIt3uzkeBPaqq/klVVfFu2IssdxBJt1ca8DG4kzTdMKmNJCn2u5Elqc19vCv/Q84HP23+3lm+HKNgIrhyMhSlIAiCIGRGut2dpcAlwNXAMlVVVwMPAA9rmiYGUvWA6Qoix3cDkGNFsSwLqZ0E5XC3bn8IC5hU6m93H/eWV8BKjVezLJTyFcTHnZ2ZAAVBEAQhg9JqSdM0Laxp2gOapp0OjAIeBi4Fdqqq+pyqqheI7tDusVwB5EQYgBwpRiRhZDmi7Fmxpw7oYDyaaaBUrmmeNCDXb0eO14kitoIgCMJhqTuzOxuBaqAm9f1Y4F/AJlVVj+2twAYLy52LlAyxLHgqu6wS6gfx0lA7aqPkeZ2cNbmkze2Ouq1IegS9xO5xbypiK5aDEgRBEA5HaXV3qqqqAOdid3eei52oPQL8SNO0Fantd6UeG9POOWTgn8BMIA7crGna5tS2IcBjLXY/ArgDuLu9Yw4XliuAlAyx/vjf8e4rGl/R2yk9MQgs21WPWpzTbnevUrkKOLDSgFK+HEvx2mPSBEEQBOEwk25LWjnwBOAErgSGaZr2DU3TVgBomqYDrwGeDs5xIeDRNO1Y7ATsj00bNE3br2nafE3T5gPfA5YB93R0zOHCdAeRTJ28+D6KqKc6PDhb0sIJnd31MSrCiXb3USpXYykejPzxgF1+I1kyA2RRSUYQBEE4/KT77vYL4CFN06o62OcF4OkOts/DTuTQNO1jVVXnHryDqqoS8DfgKk3TDFVVOz3mYKZpEol0PEsyGu0/syhlyYMfOOHTG/mecyJr90xlapEr22E1y9S9em+r3Xs+pdjX7s8vrl6Lc8gJxGIJMEIUVa2hYfI1nf68M6E/vab6O3Gv0ifuVXrEfUqfuFfpyex9CrS7Jd2WtL8BX1dV9ctND6iqukRV1Z+kEis0TUtommZ1cI4gUN/ieyPVTdrSAmCtpmlaF44Z0Eyn/cMxHR4CRKmLDs5F1j/cbv+Y543Ja3cfI2cIsTJ72KOrVkMyEiSKREUYQRAE4fCUbsLzK+Aa4OYWj90N/ASQgJ+mcY4GWqeLcqqbtKWrgb908ZhWZFnG52t/3ceW0t2vLzmDxQDILg854RghvX/EdbC+jml9hf2p5bjxpfhch04Ulht24lt+F9GZN2PkjcVTv95+fOQx/ep+9adY+jtxr9In7lV6xH1Kn7hX6cn2fUq3Je0q4EpN015pekDTtLuB64Eb0jzHYuAcAFVVjwFWt7HPHODDLh4zoDUtsu5QXPilKA3RwTkmbV9DjIBbaTNBA3DuW4J3zf9Aj9nfl6/A9BZj+ssyGaYgCIIgZEy6LWl5wP42Ht8JFKd5jmeB01VV/RC79e0GVVWvBPyapt2tqmox0HhQl+khx6R5rQHDcucCqSSNahrjg6+7s6IxTsKwmD28/X55pXI1lsPdPJNTqVhhl94YpIV/BUEQhMNfuknap8Dtqqp++aAk6qvYMzE7pWmaCdxy0MMbWmyvxC690dkxh5WmljTJlUMtAYyORvUdptbubwTgi8eNancfpXIVetEUkBWkeD1K7WbiEy/OVIiCIAiCkHHpJml3AIuAU1VVXZp6bDYwBDirLwIbLEx3EIDkiHncVnkkMwPuLEeUeZ/trMMhwcSSdpaDskyUyrXE1c8BoFTY9dJEEVtBEAThcJbuslCfAtOBp4AcwAU8CUzSNO3Djo4VOuHwYMlO5HgDfpeD0CDs7nx3cxUWoMhtd1066rcjJ0PNRWyd5csBmlceEARBEITDUdrlLDRN24ZdaFboTZKE5Q6ilC/noYZH+FzjncC0bEeVMYZpURVOUOBz4WgnSTO9RTSc8S+SQ+cAoJSvQM8f3zyeTxAEQRAOR+kuC+UBvojdmtY0/U4C3MBcTdPEujw9YLoCSMkIJVItTiOU7XAyanNlCNOCCcU57e5juYPEJyxIfWPhLF9OYuRJGYpQEARBELIj3Za0fwCfx55AMA94DxgHDOcwXKop0yx3Lph26Q23ObiqQb+9uRqAY0fnt7uPd+V/MHLHkBh9KnJoL3K0UoxHEwRBEA576dZJWwBcl1pbcytwKzAWexmodkZ7C+myXAEkw16z0msNriRtyc46AE4cX9j2DpaJ79M/4trxFmAvqg6glx6RgegEQRAEIXvSTdJygU9S/18LzNE0zQB+TarYrNB9ljvYXKTVTxTLGjx1OPY1xMlxOSgLetrc7qjfjpxobDVpwHK40QsnZzJMQRAEQci4dJO0fcCw1P83Ak3T6upJv5it0A7TFUTS7RY0P1HCicExwzOWNKgOx7l8VhlSO0VplUp7kYlksf2SU8pXoBdNBUf/WYReEARBEPpCuknaM8B/VVU9FngTuE5V1QuAHwFb+iq4wcJyBZETEe6f/QJvmbMJxY1sh5QRa/c1YFgwdWiw3X2UilUHVhowdZyVq0iKrk5BEARhEEh34sD3ACcwRtO0R1RVfQF7PFojcFlfBTdYWO4gkhHFkzeUOCHCicGRpC3UqgAwO+jeVSpXoxdOAocTR9U6JD2KLiYNCIIgCINAukna9cAvNE2rANA07Quqqn4DiGmaNjj65vpQ09JQc1bewXnyFLZXT2ZcUfslKQ4Xq/Y2ADB7ePv1zqIzbgDTTlqbitgmS47o89gEQRAEIdvSTdJ+A7wNVDQ9oGna4Cro1YfMVFHWMbUfMEP2sD8Uz3JEBzywZC9jC7ycOc3X6+feXRfF53QQ9Djb3Scx9uzm/ysVKzDdeZi5o3s9FkEQBEHob9Idk7YcOL0vAxnMmlrSDIcXPxHqIoksR3TAfZ/t5Yevb+nWjFPX1tdQ9i1pc1t1OE5MNxmR3/asTgClcg1u7enmma/O8uV26Y12JhkIgiAIwuEk3Za0CuCvqqp+H7tOWrTlRk3TzujtwAYTK7XIuuVwE5Ci1EWTWY7IFmkxNm5XbZSRBem3pknRanJfvZnkkDnUfe75Q7Z/tL0WgFnD2u/qdG96Du/K+4iPXwCJMI6ajcRbtKwJgiAIwuEs3Za0KPA/YCGwGdhz0JfQA6bLTtIkxY2fKPXR/jHMb2t1uPn/z68p79Kxrh1vAxA+6tsAeNY9RuCNryHX7wBg2a56AOZPKGr3HEpF06QBF87KVUiWiS7GowmCIAiDRFotaZqm3dDXgQxmTS1pssNJjhSjMd4/krRt1XbtNpdD4rFlu7n+qOEEOhg/1pJr+5sYvhKSw48HQIrX4d7yCu7NLxKbehWJhjOYUOxvf9KAZaFUrSE+7jzAro8GiOWgBEEQhEEj3QXWr+xou6Zpj/ROOINT05g0ffhx/LtmJEU5/aNQa1O366xhAT7Z2cC/Fu/gO6eO7/xAI4lr17vEx50Dkt1YG511C/GJF+L77C941jzEneZjfFBwCZI5DRzuQ04hN+xAjtejl6RWGqhYjhEcheUt6L0nKAiCIAj9WLrdnQ+18/Uf4Kd9Etkg0pSkWb5iNvjm9JuB8X63ncPfPm8kDlnitfXpdXk693+GnGgkMfq0Vo+bOUMIzf81q855lTfN2UwIfQpyqmXObN166KywVxrQW6w0IIrYCoIgCINJWkmapmlyyy/swrZTgU+Bn/RlgIOC7MB0+nHUbuY8/XV21PSPRdYrUqVAivwu5ozIpTFu8ElqwH9HkkOOpO7CJ0gMP7HN7QvL/dyW/BrPTLsbJBlH1ToKHjwOz9qHm5M1I280kRk3oReqyOFyHKG9ooitIAiCMKik25LWiqZphqZp64FvAr/o3ZAGJ8sdRKnZyLf1/7C3Ptr5ARnwplaJLIHLIfO1E8YAcNfi7Z0f6HCSHHYcuNouyPvpzjoA5k0ssx+wLEx/GYF3vkv+Iyfj3vQietFUwif8DBzuA+PRxKQBQRAEYRDpVpLWgg6U9UYgg53lCoBl4JQMMPpHMduGmI4i212vk0oDFPqcrCtvJKG3v2yV3LCT4EvX4qha1+4+myvDOCQYXWiX9DCKp1J38bPUn3M/OFwEF36ZggeOQg7vB+z6aJasoBdP7cVnJwiCIAj9W08mDgSBLwKf9GpEg5TlzkWKVALgNvpHd2ckYeB1Hsjjv3riGH722kbe21LDaWpxm8e4tr+Je8ciQvN+1ub2hG5SG00yNOhGbjn2TpJIjDmdxKhTcG96Fu+q+3HUbMTMGYJSsQK9cDIo3l59foIgCILQn6VbzPahNh5LAh8BX+m9cAYv0x3EEdoLgM/qH0lawjAp9B2YeXn25FLu/nAHT63c226S5t7xFnreOMy8MW1u31DeCMDkUn/bF5UdxNVLiKuX2N9bJkrFSuITL+r+ExEEQRCEASjdOmk97RYVOmG5Aki63c2ZQwzTslq3NGVY0jAxLcjzHniJOGSJ6WVBFm6oZOWeemYevFpAIoxz90dEp1/f7nk3VNhLvn7txLaTuIM5arcgJxrFeDRBEARh0Ek7+VJV9SZVVa9o8f0zqqpe1zdhDT6WO4hkxHkveD71+FotyZQNu2vtyQvFB9Vsu2TmUAD+8cH2Q45x7f4AyUyQGH1qu+dds6+BohwXw3LT67pUKlYAiJmdgiAIwqCTVpKmquq3gT/TuuVtHfB3VVVv7YO4Bh3TFURKhtFm/JDdVgmhLK86YKTWU583Jq/V47OG55HrUVi5p/6QCQSuXe9iOv0khx7Z7nnf3VKDaVlIabYSOsuXY7oCGPnjuhS/IAiCIAx06bakfQW4WtO05rFpmqb9ELgeuL33wxp8LFcQyTIoDm8gnwZqsrzIelXY7not8R+6+sF5U0sxLXjg092tHg/N+yl1n3sOHG2vmNAQSxJJGOR701taCuwitnrJzOaVCwRBEARhsEj3na8UWNvG4yuA4b0WzSBmue1VBy5cfi3nOD5lZ5YL2r63pRoAt3Joi9cXjh2FBDy9am/rDQ4XRuGkds/5yQ67EO6MsmB6QehRlOp1YlF1QRAEYVBKN0lbDVzdxuNXABt6L5zBy3IdGITvJ0p1OLstaTtq7DFpZcFD19XMcStMHRqgOpxsTia9K+4m+MpNhyzv1NL7qcTvpPGFacWgVK1DMnWxHJQgCIIwKKVbguNnwIuqqp6IvRQUwFzgJODivghssDFTLWkmMn4pSlUku0ladSQBQNDTdtfkbxdM5vx7PuX5Nfv52oljcW9+CSwT5PZfUmv22eU3Zg3PSysGZ/lyQEwaEARBEAandNfufBU4AdgPnAucAZQDR2ma9mLfhTd4WC67C9B0ePATpT6W3SStPqrjdLQ/uL8k4OH4sQU8s3IfRqgSpXx5h7M6LcuivDFOwK3gcznSikEpX47hH4qZU9rl+AVBEARhoEu3JQ3sFrTbNU2rAFBV9ThgTZ9ENQhZbjtJsxQ3gUSU+lh2Z3dGEjoepeNkqjTgJpQwWP7+M5yNRWJU+0navoY4CcPiK/NGpB2Ds3yFaEUTBEEQBq10S3BMBDYB/6/Fw88Cq1VVTa8qqdAhs6klzVdKlRXEsqysxhPXTQLujpO0m48ZBYBz25sYvhL04mmttud89CtyPvgZzp3vsGGPveTV7BF5aV1fitbgaNghitgKgiAIg1a6LWl/BZYBv27x2ATgfuz6aRf0bliDT9PszqR6IXfVzeHUnLbLWGQkFsvCIUvMHpHb4X4FOS4mFHqYGtKoLj0FqUWZDEfdVnzL/gmAb+U9XISTMuckxm09B4dyGkbBJOigVpoYjyYIgiAMdukmaccBczRNq2l6QNO0BlVVf4C9fqfQUw4PluxCTjTgdztozOKYtHDCIGFYjC3M6XTf648ZzUkv38lpETctl1R3bX0dgJor30Fu3M3LLz3GXGkFZct+C8t+i+ErJTnyRBIj7C/L23rGp1KxAkuSSRbP6M2nJgiCIAgDRrpJWgQow+7ybKkIyO76RYcLScJyB3FtW8hd0Ve5cc/vsxbK+tQi6On0uJ42sYhfvO5m4S6LHxomTofdmube9jrJomkY+eOJ547jh1GZIv/1vHbVSJy73se1811c297As+FJAJLF00mOOJHEyJNIDpmLUr4Co2AiuDpPFAVBEAThcJRukvY08C9VVb8EfJZ6bC7wL+D5vghsMDJdATASFNNIXDezFsf6cnsRdKWD2Z1NCp5ewD/K5nLTzlN5Z3M1p6vFSJFKlP1LiRz5DQC08kYsYGKxH9NfRnzy5cQnXw6mgVK5Gteu93DufBfvin/jW/YPLMUHpk5MFdVdBEEQhMEr3STtu8CTwLtAU/uKBDwHfKP3wxqcLHcQKVqNX4qSMLKXpO2pswvZji7wdbifXL8DZ8VKjjz+IobWuXl21T5OV4txb38DCYv42LMAeGezXcT22DH5B53AgV56BHrpETD3NqREI849H+Ha+S7K/iXEJ5zf689NEARBEAaKtJI0TdNCwNmqqqrANCCJXTPtaOADoNOBQ6qqysA/gZlAHLhZ07TNLbYfCfwJO/nbj71WaExV1eVAfWq3bZqm3ZDmcxtwLFcQOVxBDjEMI3u9yPsb7XU7xxf5gPaTRdeOtwBIjj6VE6oNnlixl+3VEWZsW4gRGIFROBmAZbvrADhhbEGH17VcARJjziAx5oyePwlBEARBGOC6tGq1pmkasA97lYG3gb+Q/pi0CwGPpmnHAncAf2zaoKqqBNwD3KBp2jzgNWCUqqqe1HXnp74O2wQNUrXSTHvCgMeMZi2OpiWpiv2HLgnVknv7W+h54zDzxjC+2G51u//9dbh2vU98zBnNszdDcZ05w3MZGvT0beCCIAiCcBhJqyVNVdVc4Frgi8CU1MMLgd9pmvZ2mtdqSr7QNO1jVVXnttg2EagGbldVdTrwsqZpmqqqRwM+VVUXpmL9vqZpH3d0EdM0iUQ6Xpw8Gs3u4uXt8cheFMNuxfIRJRQOI3dQpqKv1EbiOGSIRqPt3ispGaZoz4c0Tr6aSCTCKWOC/E6WkLe/g+SM01A2n3gkQiRhsK06yklz84lGs5d49rX++prqj8S9Sp+4V+kR9yl94l6lJ7P3KdDulg5b0lRVPV5V1QeAvditZnHge9h9YN/qQoIGEORAtyWAoapqU5JYhF3m45/AacCpqqqeij2r9A/AmcAtwMMtjjnsmK4gkqHzx+mvsp8CosnsjEsLuBWGdNKK5qzdhCUrRIfPt793yBw7KpeT5SXElCDx0jkArNjbgEWHJdEEQRAEQWhDuwmPqqprgMnAcuCXwBNNY8hUVf1lN67VQOt0UdY0rWnto2pgs6Zp61Lnfw2Yg50YbtY0zQI2qqpaDQwFdrV3EVmW8fk6HvDeJN39MkXJKUA2ouTl5gG1GLITny/zXYSmJTGxNNDq/hxyr0YfR/VNq5FlBV9qUfVbjx/BpCeX87Z5JHP99goKH+/aDcCE0tx+d7/7wmB4jr1F3Kv0iXuVHnGf0ifuVXqyfZ86akmbhF0X7SXgvZaD/LtpMXAOgKqqxwCrW2zbCvhVVR2f+v4EYC1wI6mxa6qqlmG3xu3rYRz9lplav/PEFV9nhrSFvfXxjMdgWRb7GmK4HR28NCwLjAQoHpAP5PmTkuvIk8I8FzuCXbV21+bqfQ0AHDkyry/DFgRBEITDTkdJ2jDgP8DngHdVVd2rqupfVVU9kQNlOLriWSCmquqHwJ3AN1RVvVJV1S9qmpYAbgIeUVX1M2CXpmkvA/cCeaqqfgA8DtzYovXtsGOl1u+cEPqUYVIVFY2xjMdQFU4Q003qO1jxQKlaS+G9M3DuXtzqcde21zFkNx+YM3hutZ1L766L4nM6yPU6+zRuQRAEQTjctNvdqWlaOfZ4sD+kymNcD1wJ3Jra5RZVVX+vaVq7XY8Hnc/EHlfW0oYW2xcBRx10TCJ1zUHBSrWkAfilKNWRRMZj2FIVBqA00P6YNNeOt5CTIfSCiQcetCzc2xaijzyBI/VhvLB6PxfPGErCsBhX5O3rsAVBEAThsJNWCQ5N0z7TNO1W7PFgVwCvAl8Gtqqq+kwfxjeoWK4DQ/b8RKmNZL7RcHuNPaNlWF77Y+Fc298kWXIElq+4+TFH1TocjbtJjDmT0QU+6mI6v19k95DPGt7xQu2CIAiCIByqq3XSkpqmPalp2nnAcOD7wIQ+iWwQMt0Hkhk/UeqimV9kfU+d3cU6Kr/twZJStBqlfAWJ0ae2ety97XUsJOKjT+e4MXbR2sXbanFIcOWc4X0btCAIgiAchrpdziLVHfr71JfQC5pa0kzZSY4Uo6GDcWF9ZV9qtYFxhW0naa4dbyNhkRjVOklzbXsdfehcLF8RR3gt8rxO6qJJJpb4O+w6FQRBEAShbV1qSRP6VtOYtMiEz/G0eVKnFf/7gkOy1+Vqr7tTDu/HCI5CL5524LGG3Tir1hIfcyYAkiRxycyhAFldKF4QBEEQBjKRpPUjliuAhYQUHEqVezSG1Z1JtD2T41Io8rtwyG2/NKJzvkrNVe+BdGC7e9vrAK3W3Lx0VhkAhT5XH0YrCIIgCIevw7Z6/4AkyVguP0r5cs6wTDZWnZ7xEHbXRclvp1yGFG/AUrzgaL3dte119PyJGHljmx8r8Ll46OrZDAmKrk5BEARB6A7RktbPWK4gStVaPme+zp76zNdJW7O/kZpI22PhfEv+QuEDR4JxYLsUq8W595NWrWhN1FK/qI8mCIIgCN0kkrR+xnLbkwf8Uiwr47mShkXQ03YDq2vHW+hFU1q1pLl2vIVkGcTHnpmpEAVBEARhUBBJWj9junLBsghIURJGZpO0xrhdl63Qd2jrl1y/HaV2M4lRp7R63L1tIYavFL1kZkZiFARBEITBQiRp/YzlDoBl4idK0sjsxIEtlfZqAyVtlMxwb38LgHjL0ht6DNeOd+yuTkm8lARBEAShN4l31n7GcgWRTJ0cIhhmZpO0banVBspyDy2/4dqxCD1vLGbemAOP7V6MpEeItzEeTRAEQRCEnhFJWj9jt6QZvBG8BDAxM1iGoya1Vujo/IPW2rQsLMVDYuw5rR52bXsN0+knOfy4TIUoCIIgCIOGKMHRz5iuXCQ9xqZJX8eq2EYkYeB3Z+bHlJ+qaTZj2EFrbUoSDefce1CgBu5tb9hj1ByizIYgCIIg9DbRktbPWK4AkmUwvP4zvMSoiyYydu2qkL0kVGFO6wK0SsNOMFsv9q6UL0eOVrVZekMQBEEQhJ4TSVo/07Q01CXabYyR9rO/IZ6xay/UKnFIEk5Hi5eFZVH66pUEFn2r1b7uba9jyc5DZnsKgiAIgtA7RJLWz1iuYPP//USpCmdukfX6aBLFIbV6zFmzHiVSQWLY8QcetCxcW18jOezY5qRSEARBEITeJZK0fsZskfTkSDGqw5nr7owmTXxOR6vHfLvexkIiMerk5scctZtR6rc1L6guCIIgCELvE0laP9OyZSpAlNoMjklL6OYhqw14d79Domg6lq+4+TFX84LqmV9bVBAEQRAGC5Gk9TOtujulKPVRvYO9e09cN7CA/JarDSTCuKrWECs7vtW+7m2vkyyZiekvy0hsgiAIgjAYiSStn2nq7oznDKPeyslYnbSmQrYl/gPlNORYLbGhxxIbesyBx8LlOMuXi1mdgiAIgtDHRJLWz1gue4H15PRrWCgdS57X1ckRvUPGnjAwb2xB82NmcDgVZ95HrOzY5sdc294AEOPRBEEQBKGPiSStv1E8WA43cqKBHKdMQywzszurUhMUWi4J5ajbekh9NNe21zGCozAK1IzEJQiCIAiDlUjS+iHLFcCz5kG+r/+T5bvrM3LN97dUA+Btmt1p6uQ9cTb5n/6qeR8pEcK1e7HdiiZJbZ1GEARBEIReIpK0fqhpXFpQjhJNGhm55sGLqyvVG5CTYeLFs5r3ce58B8lMkBgrujoFQRAEoa+JJK0fsselSfil/9/enYdHUaQPHP/2nLkmByGQACFcUiAIiqCiiIp4i8p6rLuuoiCgP1A8uQRRREUEF1ZlBUVBRREXj0XRRV3FRUS5BA3QXOFOIAnknsyR6d8fk4RADgbIMUnez/PwPElXV3dNpSEv1VX1FlLo9dXKPY/ke9CgNE+oJXUNAK7m55eeY9/1Nb6QJnjie9ZKm4QQQojGTIK0IGTY/QnOHZoTT1HtBGk5hR6sZbINWFPXUBSRQFHJNhtFHmx7/ourzVVgMldyFSGEEEJUFwnSgpBhc4Dhw4ETT1HtbMGR7yk6Nh/NMLCm/oon4YLScuvB1ZjcObL1hhBCCFFLJEgLQj57JBhewrXCWtsnze31lb7qxOukqElHPK0uLS23p3yNYQnBndi3VtojhBBCNHaWk58iapthi0Qz4M1uS/D9coAin4HZVHOrKQ3Df/1zW/pfs2INI/umD/xfFxT4E6qnLMedeBlYQ2usHUIIIYQ4RkbSgpBhj0QrKiTK7n/9WOCu2RWeea4iPEUGHeLCATDlHTxufzRbZjLmvFTZwFYIIYSoRRKkBSFfcf7OGzcNI5TCGk+yvjktF4CSdQNRX9xD5FdDS8tD936HoZlwt+lfo+0QQgghxDESpAUhw+5PDdW2cDMOnBzOddXo/ZLTcgAwmzQ0VzbmTB1vs+6l5WF7v8WT0AsjtElllxBCCCFENZMgLQgZtqjSrx1aARn5NZsa6kB2IQBtm4RhTVuHhoEnoRcAlty92I7quNteW6NtEEIIIcTxJEgLQiUjaQAROMnMr9nXnSUjde3jwrGkrsXQzHia+zMNhO79DgCXbL0hhBBC1CoJ0oJQyZw0gAjNWeNz0jIL/CN1MaFWrKm/4o3rCtYwAML2foc7RuGLSqrRNgghhBDieBKkBSGjbJCGkyynt4qzz1y204PFpKEV39udeJm/wJ2P/dB6nK0uq9H7CyGEEKI82SctCBnFCdazO93Fr7914k/hthq9X7jdjM1sAk0j54a3S49bU39FM7wUJvSWaF4IIYSoZfK7NwgZtggMNCwRceSbo2o8f6fPB6p5BJrzyPH7ox34CcNkxdW8R43eXwghhBDl1dpImlLKBMwGugMu4H5d13eUKe8FvAJoQBrwN8BdVZ0GSzNhWCOw7vuRS0zhbE+PrrFbGYZBWq6Lc1o4cPwwGnP2Ho7e+Q0A1gM/44rrjmGRLANCCCFEbavNkbRbgBBd13sDY4EZJQVKKQ14E7hP1/U+wNdAUlV1GjrDHon18G9caGziYPEWGTXhUK4Ll9fH0XwP1tS1eJt2AUBzZWNJ/53ChItq7N5CCCGEqFxtzkkrCb7QdX21UqpnmbKOQCbwiFLqHOBLXdd1pdTwKupUyOfzUVBQUOU5TmfV5cEgyhqBSTPj0JwUuL0n/UynK/lAFgCdrWmYnBnkx3ajoKCA0L0/ohk+spucWy/6q65JHwVO+ipw0leBkX4KnPRVYGq3nxyVltTmSFokkF3m+yKlVEmQ2BS4GP+rzf7AlUqpK09Sp0Hz2RwYmgmHVojLW3Nz0vZm+UfpztO2AeBqdj4AIamr8ZntFMZ2q7F7CyGEEKJytRnw5HB8uGjSdb1klnomsEPX9c0ASqmvgfNPUqdCJpOJsLCwgBoU6Hl1QQuNQUMj0uTEXWTUWFsP5/uTt3fxbcVnj8La4hysmomwQ7/iTehFSEQ0ENx9FUyknwInfRU46avANKZ+KirycvRoOl7vqe2j6fMZAHg8+TXRrAajJvrJYrERExOH2Rx46FWbQdpPwABgsVLqIuD3MmW7gAilVIfihQGXAvOAnVXUadAMmwMwcODEW2TU2H3SirMNOEKs/v3RNBOaMxNL5hbyLxxTY/cVQghx+o4eTSckJIzw8Hg0TQu4XlGR/z/mZrO5pprWIFR3PxmGQX5+DkePptO0aULA9WozSPsUuEoptQr/Cs77lFJ/BSJ0XZ+rlBoCfFC8iGCVrutfFq8IPa5OLba3Thn2SAzNzLro6zCl19x9Sv5qe6+eQW7xX3TrgZ8BcLe6uOZuLIQQ4rR5ve5TDtBE3dE0jfDwSPLysk6pXq0Fabqu+4AHTji8tUz5f4ELAqjTKPjsUWhFhWS0GYg7bS9FPgOzqfr/MkaGWmkTXoTJ8IHm/x+D7cAqfNZwvHHdwFWzyd2FEEKcHgnQ6pfT+XnJZrZByrA50AwfSUf+B0C+u2ZSQ6VlFzLC/Amx75xXupGt9cAqPAkXgNlaI/cUQghR/7hcLm67bUCl5evXr2XSpHG12KKGT4K0IFWSGur2vU8DcLSgZka0Nh3MQbmTKYpuByYLpvw0LEd34Gl1SY3cTwghhBCBaRTbWdRHvuIk6zY8WPCSnuciqUn1rlwyDAOTz0VnduFJuB8A6/5VAHhaynw0IYRoSJYtW8rKlStwuVwcOZLJ7bf/hf/9bwUpKTsZMWIUTqeTxYs/xGq1kpjYmtGjn8LtdjN58gRyc3Np2bJV6bV27tzBzJkvYxgGUVFRjBs3qQ4/WcMlQVqQKhlJAwinkIz8U1tmHYhsp4du2i6seClI8E8HtB5Yhc8eVZp5QAghRMNRUFDAjBmv8v333/LRRx8wd+58NmxYx6JFC9mzJ4V33llIWFg4//jHDD7/fAkAbdu2Z/jwESQn/8H69WsBeOmlKYwb9zRt27bjiy8+Y+HCBfTqdWFdfrQGSYK0IOXfgsPPoTk5UgOvO3dk5NPT5N/E1hPvT+ZgO7AKT4uLwCTLs4UQoqE56ywFQESEgzZt2qJpGg6HA5erkLZt2xEWFg5A9+49WLNmNQAXXtgbgC5dumKx+MOGPXtSmDFjKuDfsy0xMam2P0qjIEFakDLsUaVfR+CskTlpu48UEKEVkBamMIc2wZSzD3POXpzdhlT7vYQQQtS9ylcYauzenYLT6SQ0NJTffltPYmJrNM3EH3/8zqWXXs62bVvxev0LzFq3TmLChMnEx8ezadNvZGZm1N6HaEQkSAtSvuKRtPyIdnhcZnxG9W9om5HvZp73TmIv7sSV+F91Arhl0YAQQjQqZrOZwYOH8/DDw9E0E61aJfLAAyMxm828+OKzPPjgEJKS2mC1+lf9P/74OKZMeRqfz5+2cOzYiWRk1OCmno2UZtTAL/+65PEUGVlZVSdGLUlWHtQpRLyFxM3pQM4Fo+n247k8eEkbBl/Uulpv8emG3Uz9726+GN6buAg7jm9HYdv7A5n3/QbF/9uqF30VBKSfAid9FTjpq8A0xn5KS9tDfPypv2KUjAOBqal+qujnFhfnWAf0rOh82YIjWFlCMMx2zK4sws1FHCmo/oUDLfd+zm/2ocQaR8AwsO7/CXfLi0sDNCGEEELUHQnSgphhDSd841wGsIJNB3Oq/fqW1DW4sWKKiMecnYI5P0223hBCCCGChARpQcxXvA2HAycF7qJqv/7ZRVtYbyjQtGP7o8l8NCGEECIoSJAWxAybf4VnpMmJ01O9QZop/xCJHOJ3c2fAv2igKLw5RVFtq/U+QgghhDg9EqQFMSMkGkMz4dAKKfT6qvXaltQ1AOy0dQHD8O+P1vISmY8mhBBCBAkJ0oKYfxsOE5GmQtxF1RukeXPSOGJEkBmhMB/ZhsmZIfPRhBBCiCAiQVoQ86eGMggx+fD5qnerlI0Jd9DT9QZNIiOwHvgJkP3RhBBCiGAiQVoQM2wOMFlY2flZzKZqfA3pK8JiGPgwcVGbGGwHVlHkSMQXmVh99xBCCNGoLFny0RlfY9iwe0lNPXjK9fbs2c3IkcPO+P4nc9NN11Ralpp6kGHD7q3W+0nGgSBm2KPQilxEWg2cHh9en4GlGoI168HV9P7yfrppo2kV1Q3rgZ9xtbu2GloshBCitn2ZfIh//5EW0LklG9hXnh7K76au8dzQpfkptWPBgre59dY/n1IdUTUJ0oJYSWqoa7aM4Q1Gke/yEhVqPePrWlPXYPbmsduIp7lzOyZXtsxHE0IIEbC9e/fwwgvPYrFYMJvN9OjRk5ycbKZPn8qDD45k6tQp5OXlkp2dxYABAxk48DZGjhzGWWcpdu3aSUFBHs899xLx8QnMmfM6v/zyM82bNyc7OwuAw4cPMX36VNxuFzk52dx771D69r2cu+++g8TEJKxWKw899BiTJ0/AMAyaNImtsr3r16/l/ffnY7VaOXz4EDfffCvr169lx45t3H77Xxg48DbWrFnN3Ln/xG6343BEMmbMBBwOB9OmPU9Kyi5atmyF2+3fWP7QoTSmTXsBt9uFzWZn9OjxNdLPEqQFMaN4n7Q2nh0AZBW6qy1I22tOIodwWmavBZAgTQgh6qkbujQPeNSrutIdrVnzC0p14qGHHmPjxg3ExMSwZMlinnhiLLq+lf79r+ayy/qRkZHOyJHDGDjwNgA6d+7CqFGPM2fO63zzzX+45JI+bNy4gbfeehens4A77/wT4H99eeedd9GjR09+/30j8+bNoW/fy3E6ndx77xA6duzEa6/NpH//a7jppoF8991yPv30X1W2+fDhw8yf/wFbt27h6afH8tFHn5Gefpjx45/klltuZdq0F5g9+y3i4prx0UcLee+9t+nRoxdut5u5c+eTlpbGDz98B8Drr8/ittv+TO/el7B27a+88cZrDBv2f2fUpxWRIC2IGTZ/kBZq+PPSped5SIo5w4v6irCkreM3+qABYamr8Ua3wxeRcIYXFkII0VjceOPNLFy4gMcff4jw8AiGDx9RWhYbG8vixR+wYsX3hIWF4/V6S8s6dlQANG/enMzMTFJSdtGpU2dMJhPh4RG0a9eh+BpNWbBgHl9++TmgHXeN1q3bAJCSsotrrrkegHPO6X7SIK1du/ZYLBYcDgctWrTEarXicETidrvIysoiLCycuLhmAHTrdh5vvvlPYmKa0LlzFwDi4+Np1swfDO/atYP33nuHhQsXAGCx1Ew4JQsHgljJSJrNKAQMMvJcZ3xNc+ZWTJ48fi3qSIjZwHrwF//+aEIIIUSAVq5cQffu5zFr1j+54oorWbhwQel8tw8/fI+uXbvx9NPP0a9f/9LjUH4uXOvWSWzZkozP58PpdLJ79y4A3nrrDa699gYmTnyOHj2Ozz1eco2kpCSSkzcBsGXL5pO2uappeNHR0RQU5JORkQHAxo3rSUxMJCmpTek9MjLSSU9PL253Gx588CFee20uTz45nssvv/Kk9z8dMpIWxErmpJkwCMXFkXzPGV/TnLMbw2xntesszrfuxuTJk1edQgghTkmnTmczefJEzGYzJpOJhx56jNTUg0yePJEbb7yZ6dNfZPnyr4iKisJsNpfO5TrRWWcprriiP/fffw9Nm8YRE9MEgCuuuJJZs6bz3nvv0KxZc7KyssrVvf/+B5k0aRzffrucFi1antHn0TSN0aOf4qmnnsRk0oiIcDB27NPExsayadNGhg4dRHx8AtHR0QCMGDGKGTOm4na7cbkKGTXqiTO6f6XtKhvhNgQeT5GRlVVQ5TkFBf7ysLCw2mjSaTPlHiD23QsB6FU4mxsuOIcRl1ZD2qYiNxfO+pkx4V8x3Ps+Gff9hhHWtMJT60tf1TXpp8BJXwVO+iowjbGf0tL2EB+fdMr1qmtOWkNXU/1U0c8tLs6xDuhZ0fkykhbEjOKRtPRO95H1WwTRIdXw4zJ8GCYrmqbR17IZb6SqNEATQggh6pN33nmTdevWlDs+fvykMx5tqwsSpAUxwxaBgUZIeCQeLBSeYWooU+4BYj66mkN9pmH2hdDBnYy7413V1FohhBCibt1331Duu29oXTej2sjCgWCmmTCs4dh3f8tZpgPoh/PO6HLW1DWYXNlsd0VzrrYDq88l89GEEEKIICVBWpAzrGHYMpNpy0HScs5sdac1dQ0+azg/5cfT27QZAw1Pi4uqqaVCCCGEqE4SpAW5kr3SIiikwF10Rteypq7B27wHB3K8XGxOJieqM0ZIdDW0UgghhBDVTYK0IOcr3ist0lSI03P6QZrmysGcuQVPQi+OZmdzrrYDX2vZH00IIYQIVhKkBTnD7k8xEGlykncGI2nmozvAbMOTcAEt8n7HrnkxJV1aXc0UQgjRyCxbtpR//vPVk563fv1aJk0aVwst8rvppmtq/B6TJo1j/fq1lZbfdtsAXK4z34BeVncGOSMkCgNoYnFTUFDEzox82jcNP+XreON7kDF0M2Cis2sxXsOEJ+GCam+vEEKI2mXf+i9CtiwK7OSSrVGr2H0foLDznbg63XZG7RJnToK0IOezR2GYQ2jW7Xq0n+EbPf20gjQMA8x2AC7gD7aYziLeFlHNrRVCCNGYJCf/zqhRD5Kfn8/gwcNwuQr55JOPS1NBTZky7bjzlyz5iBUrvsfr9RIREcHzz7/MN998zc8//4TLVciBA/u5665BXH/9AJKT/2DWrOkYhkFcXDMmTXqO/fv3M3PmyxiGQVRUFOPGTSI0NJRp054nJWUXLVu2qjS7QYk///kWunbtxv79++jRoyf5+Xls2ZJM69ZJTJz4HKmpB3nxxcl4vV5MJhOjRj3BWWd1ZMmSxXzxxWfExjbl6NGjAHi9Xl5++QX279+Hz+dj6NAHy6WxOhMSpAU5wx6J5nPTq/c19DzwO9/o6Qy/OKlc/rMqFXlo8v7FFJw/ClfHW+hk7OA/kX8mvuaaLYQQopa4Ot0W8KhXde+kHxISwssvzyIr6yjDht3LgAG38PLLswgJCWHatOf59defado0DgCfz0d2djYzZ87GZDLx2GMj2bIlGYD8/DxeeeU19u3by5gxj3L99QOYNu15nn32Bdq0acsnn3zM7t27mTFjKuPGPU3btu344ovPWLhwAV27dsPtdjN37nzS0tL44YfvqmxzWloqs2a9QdOmTbnuun7MnTufRx8dzR133Exubi6vvz6TW2+9gz59LmPXrh1Mnfocs2b9k48/XsS77y7CZDIxZMjfAFi69DOioqIZN+5psrOzGDFiGO+/v7ha+hYkSAt6hi0SzfBh3buCuPB41uzNYtvhfFTzwEfBLBnJmPNSMexRmA78ggUfux3nIy87hRBCnIlu3c5F0zRiYpoQHh6BxWJhypRJhIWFsWfPbrp27VZ6rslkwmq18swzTxEaGsrhw4fxer0AdOjQEYBmzZqXjoQdPXqENm38qRD/9KfbAdizJ4UZM6YCUFTkJTExiZSUnXTu3AWA+Ph4mjVrXmWbIyOjiI/3D1OEhobStm07AMLDI3C7XezevZvu3c8D/LlFDx8+xJ49u2nbth02mw2g9H47d+5g06YNbN78R2mbsrOzTrc7y5EgLcgZdn9qqPC1M8nSngPg662HTilIs6b6U2R4Enri/fUNXIaFDUZH7qj+5gohhGhEtmzZDEBmZgb5+XksXvwhS5Z8AcCjj46gbH7wHTu28+OPP/DmmwsoLCwsHY0CKnw71LRpU/bt20tiYmvef38+iYlJtG6dxIQJk4mPj2fTpt/IzMzAYrHw7bf/Af5CRkY66enpVbb5ZG+i2rRpw6ZNv3HJJX3Zvl2nSZNYWrRoye7du3C5CrFYrGzbpnP11deRlNSGZs2acc89g3G5Clmw4G0cjshAu++kJEgLcr7ifdI0dw43nx/PqpSjfJl8mIf7tgv4laf14GqKHIn4IhKwHVjFBuMsYqOq7yESQgjROLlcLh5++AGczgLGjJnA559/wuDBfyM0NBSHw0FGRjoJCS0AaNUqkdDQUIYMuRubzUpsbFMyMioPqJ58cjwvvjgZk8lEbGwsd9zxV5o3j2fKlKfx+fxpEseOnUjr1kls2rSRoUMHER+fQHR09Bl9phEjHmHq1CksWrSQoqIixo2bSExMDPff/wAPPDCY6OgYQkNDAbj55j/x0ktTGDlyGPn5eQwceDsmU/VtnKGVjXIbAo+nyMjKKqjynIICf3lYWFhtNOmMWPf9SPS//0pRRAvS7lrNla+votDrY95fzqVbi5MHWvbtS4lc/iAF3YdS0HMUsfPOYabnT1guHc2d55882Wx96qu6JP0UOOmrwElfBaYx9lNa2h7i45NOuV51z0lrqGqqnyr6ucXFOdYBFa42kJG0IFeScUDz5GOzmLhKxbE0+RBfJqcFFKRp7hzcLXuT33sstj3fo2GwyteFu2NDa7rpQgghRJ1YuXIFixYtLHf89tv/wmWXXVEHLTo9EqQFOcNeEqQ5Abj5nHiWJh9iuZ7O6CvPwmyq5JWntxAsIRR2uYvCs/8CmgnrgVW4sLPRaM8zp7ONhxBCCFEP9OlzGX36XFbXzThjtRakKaVMwGygO+AC7td1fUeZ8seAIUDJC+rhuq7rSqkNQHbxsRRd1++rrTYHg5I5ad4mHcEw6NYikuEXJzFn1R427M+mZ+vocnU0dy7RnwykUN2G87wHQPO/H7ft/4nNtrPxFFqJC7fV5scQQgghxCmqzZG0W4AQXdd7K6UuAmYAN5cp7wHco+v6upIDSqkQAF3XL6/FdgaVktWd7vY3gKahAX/r2Yp31+xj+dbD5YM0n5fI/zyI+ch2vE27lB7WCjKwHNFJiRlMc8N+avusCSGEEKLW1WaQ1gf4GkDX9dVKqRMnyZ0PjFNKxQNf6rr+Iv5RtzCl1PLito7XdX11VTfx+Xylk0gr43RWXR5sfCYbRdkHKMjLBpMVd5GPUIuJr7YcZmTvFljMx1aSxKyejG3vD2RePJm82POhZEJtyg8ArCzqTEyI+aR9VKK+9VVdkX4KnPRV4KSvAtMY+8nnM0ont58Kw/CvijyNqo1KTfWTz2dU8PvXUen5tZlgPZJjry0BipRSZYPERcADQD+gj1LqRqAAmA5cU1y28IQ6jYJhCSFy6/tYs1MAsJlNRNgtFHp9rN2fU3qeY/N7RG55n+wug8lTdx53jZDUX/BZw/k8I55Mp7dW2y+EEKJ+GDXqQfbs2c1XX33BTz/9CMAnn3xcx61qvGoz4Mnh+HDRpOu6F0AppQEzdV3PLv7+S+A84Btgh67rBrBNKZUJJAD7KruJyWQKeBl2vVmubQ0Hdw6hJi/W4jb/uUdLXv7vTpZuPUK/zi3A4yRq8zu42lyNu+8kwkzHLxsOPfQLnhYX4dbNRIdaT/mz15u+qmPST4GTvgqc9FVgGlM/5eRop7U9RMnIUFV1zWYzN954bDbSe++9w+2331np+Q1RIP10Okwm7ZSe09oM0n4CBgCLi+ek/V6mLBL4QynVGcjHP5r2NjAYOAf4P6VUi+LzUmuxzUHBsDkgPxXNnVt67NrOzZjx/U5W7z6Ky+vDbg0l69bP8FkdcEKAZspLxZK1i4z2fwYdmkbIogEhhGhM9u7dwwsvPIvZbC4NwpYtW4rJZCIzM5ObbhrIrbcey0Mzb94cYmNjyc7OJicnm+nTp/LEE2Pr8BM0TrX5uvNToFAptQr4O/CoUuqvSqlhxSNo44Hvgf8BybquLwPmAdFKqZXAR8DgktG3xqRkGw6TO6/0WGSIlW4tIon2ZeH5/EG0wix84fFgK7+1hvXAKgD00HMBiHfYa77RQgghgsaaNb+gVCdeeeU17r77PnJzc8jISGfq1FeYO/cdFi/+gKNHj5SrN2jQECIjoyRAqyO1NpKm67oP/7yysraWKX8PeO+EOm7grzXfuuBWZI/CCmievOOO/+3cprRPf4SEQ/tw5u7HGxJdYX3r/lX47FFs8rQGUmgZJRvZCiFEY3LjjTezcOECnnxyFOHhEVx44UV07dqtNGF4u3btOXBgfx23UpyoNkfSxGkyQmIw0KB4tYn/oI8bdk+hu2knj3tHkhvducK6tp3LCNm2BHdSPw7lewBIaiJBmhBCNCYrV66ge/fz+PvfX+fyy/uxcOG7bN++jaKiIgoLC0lJ2UWrVq0rrNvQ0kfWJxKk1QNGWFOw2CnsclfpsbBfZxC68wu2nf0oX3rO57tt5ZPU2rf/m8j/PIi32bnk9X2eFlEhAJwdX/lyXyGEEA1Pp05nM3fubEaOHMa///0pt956B16vlyeeeJj/+7/7GTRoSKWJydu0acvkyRNrt8ECkLRQ9YJhi0TzFkKRG8w2rPt+JHztLJyd7yS323BYv45F6w9wY5f40jp2fQmO7x7FE9+LnBsXYNgiyMg/gkmDJmGycEAIIRqTli1bMWfOO6V7q23cuIEtW5J59tkXjzvvtdfmAjBkyPDSY6++Oqf2GiqOIyNp9YCveOFA+KopAHhaXkJu3ynkXfYCSU3CcNgtbDucT57Lv6bCvmUxjm8fwdPiIrIHvIdhiwDg+20ZWEymyvN9CiGEECJoSJBWDxg2/+tJ24HVWNLWgclM4Tn3gtmGpmn0V00xgM9+TyUkeSGO/z6OJ/FSsm9YANZj+7Fk5rsxy09cCCEavR49epYbRRPBR35l1wOGPQoAS+ZmIr8eDkWu48oH9UoEwLfubRw/jMGddAXZ178N1uMXCDi9RYTb5A23EEIIUR/Ib+x6wGfzv+40NDO5V78G5uP3OWsZHcrDYd/wmPcd8hL747xuTrlzDMPAU2QQFSI/ciGEEKI+kJG0esCw+1935vcchafFReXKQze8wWO+d/iqqBcLWj5TLkADyC30z1drEi6LBoQQQoj6QIK0esCw+V93GhEJ5crC1r5KxKopFHYYwMvho/l+Z3a5cwB2ZeYD0FyyDQghhBD1ggRp9UDJSFrZ3J0YBmG/vkL4Ly9R2HEguVe9St+OzVm7L5vUbGe5a1iKV3T2SoyujSYLIYRowFwuF0uXfhbQucuWLWXlyhWVlr/33nw2b/6jmlrWsEiQVg8Y1nAMzYTmKh4lMwzCfnmZ8DWvUNjpDnKvnAkmC/EO/2a1837ZV+4aR53+152tJduAEEKIM3TkSGbAQdr11w+gT5/LKi2/++57OfvsrtXUsoZFZpHXB5oJw+ZAc+WAYRD+8/OEbXgD59l/Je/yqaD5Y+2bz4nn5f/u4PvtGUy4uuNxl/h5tz9xriwcEEKIhuXbb//D8uVfBXRuSYonTat6v8yrr76O/v2vqbT83XffZvfuFC69tBc9e16A0+lk7NiJfP31l2zdupmCggLatGnL+PGTmDdvDrGxsbRu3YaFC9/FarWQmnqQfv2uYtCgITz//DNceeXVHDmSyc8//4TLVciBA/u5665BXH/9ADZv/oNXXplGWFgYMTEx2Gx2nnrqmYD7pz6T39j1hGGLxOTOIfynZwnb+BbOroPI6/tcaYAGYDGb6JLgYNPBXLYdzqNjs4jSMv2wzEkTQghRPe65ZzA7d+7gwgt7k5ubyyOPPEF+fh4Oh4OZM2fj8/m4++47SE8/fFy9Q4dSmT//QzweD7fcci2DBg05rjw/P49XXnmNffv2MmbMo1x//QCmT3+RCRMm065de+bMeZ2MjPJpEBsqCdLqCcPmwL7jC7QiFwXdhpDf5xmo4H9C9/RK5InPN/PWz3uYdnOX0uNHC9xoGtgs5lpstRBCiJrWv/81VY56lVWSFspsrr7fBa1bJwFgt4dw9OhRJk0aT1hYGE6nE6/Xe9y57dp1wGKxYLFYsNtDyl2rQwf/W6BmzZrjdrsByMjIoF279gB0734e3323vNraHuxkTlo94bNH+gO0c4dXGqAB9G0fi91iYvWeo8cdzy30Ypd0A0IIIaqBppkwDB8ApuKFaatX/8Thw4d49tkXGDZsBC5XYenr1WP1Tnbd8ic0a9aclJRdACQn/14Nra8/ZCStnijsOgh3u2txdhtS5VOuaRpXqziWJh9iz5ECkpr400IVeHyE22UUTQghxJmLiYnB4/Hich3LgNO5cxfmz5/HsGH3YrPZaNGiZbW8mnz88TG8+OJkQkPDsFotxMU1O+Nr1hfaiVFufefxFBlZWQVVnlNQ4C8PCwur8rz66nCuixvn/sKwi5O4v7d/GPqCGT/SOiaUfw3udUrXauh9VV2knwInfRU46avANMZ+SkvbQ3x80inXq4nXnTVtyZLF9Ot3FTExMcydOxur1cp99w2t0XvWVD9V9HOLi3OsA3pWdL6MpDVAzRx2OsSF88G6Awy+0J/XU9Pg7HhHHbdMCCGEODVNmjThscdGEBoaRkRERKNZ2QkSpDVYHZqGsz09n6+3pnNxmyb4DOgsQZoQQoh65oor+nPFFf3ruhl1QmaSN1DDLvYPp36wbj9bD+UBEGKRH7cQQghRX8hv7QaqVXQosWFWtqXns3aff6Wn+WTLaoQQQggRNCRIa8Cu7tQMw4Cvtvg3E2wb23gm1QohhBD1nQRpDdh9xYsGDuf5NwRs11SCNCGEEKK+kCCtAYsJs3Fey0gANCDcJutEhBBC1J6RI4exZ89uli1bysqVK8qV33RT1ZkSVqz4noyMdDIzM5g+fWpNNTNoSZDWwA3sngCA1Szz0YQQQtSN668fQJ8+l51yvY8//pD8/HxiY5vyxBNja6BlwU2GVhq4vu1jMWkQaq0/GxcKIYQ4NaNHP1Lh8WnTZgLwxhuvsWvXjtI0TSXpl4YPH0n79h345puv+eabr8vVq8z48U9y++13ct5557NlSzKzZ/+D6OgY8vJyyc7OYsCAgQwceFvp+fPmzSE2NpYBAwYybdrzpKTsomXLVqX5OXft2sGrr/4dn88gL8+fsD03N5cdO7YxZcrTTJz4HFOmTGLu3PmsWbOauXP/id1uJzIyinHjnmb7dp2FC9/FarWQmnqQfv2uKpe8vT6SIK2BC7dZeLhvO2LDbXXdFCGEEA3EgAG38NVXX3DeeeezbNkX9OjRk3bt2nPZZf3IyEhn5MhhxwVpJVavXoXb7Wbu3PmkpaXxww/fAZCSsouRIx+lffsOLF/+NcuWLWXMmAl06NCRJ58cj9VqBcAwDKZNe4HZs98iLq4Zixd/yIIF87j44j4cOpTK/Pkf4vF4uOWWayVIE/XDXT1b1XUThBBC1KCTjXw98MBIoPJ0R1dddS1XXXVtwPe78MLezJ49i5ycbDZt2sD06f/gjTdeY8WK7wkLC8fr9VZYLyVlJ507dwEgPj6eZs2aA9C0aTPmz38Lu91OQUEB4eHhFdbPysoiLCy8NH/nueeex5w5s7n44j60a9cBi8WCxWLBbg8J+LMEM5mTJoQQQohTYjKZuOKK/kyfPpVLL72cRYvep2vXbjz99HP069efyvKCJyW1ITl5EwAZGemkp/sTsM+a9TJDhgxnwoRnad++Q2l9k8mEz+crrR8dHU1BQT4ZGRkA/PbbehITWwP+9IcNjYykCSGEEOKU3XDDTdxxx80sWvQpqakHmT79RZYv/4qoqCjMZnPpfLOyLr30cjZt2sjQoYOIj08gOjoagKuvvo6xYx+nSZMmxMU1Izs7C4CuXbsxZcokRo9+CvDPpRs9+imeeupJTCYNhyOS8eOfYdeuHbX1sWuVVlm0W195PEVGVlZBlecUFPjLw8Jk37CTkb4KjPRT4KSvAid9FZjG2E9paXuIj0865XqVve4Ux6upfqro5xYX51gH9KzofHndKYQQQggRhCRIE0IIIYQIQhKkCSGEEEIEIQnShBBCiHqooc0pb+hO5+clQZoQQghRz1gsNvLzcyRQqycMwyA/PweL5dQ2lpctOIQQQoh6JiYmjqNH08nLyzqlej5fyf5jDXBTsWpUE/1ksdiIiYk7tTrVdnchhBBC1Aqz2ULTpgmnXK8xbldyOoKln2otSFNKmYDZQHfABdyv6/qOMuWPAUOA9OJDw4HtVdURQgghhGioanNO2i1AiK7rvYGxwIwTynsA9+i6fnnxHz2AOkIIIYQQDVJtvu7sA3wNoOv6aqXUibvrng+MU0rFA1/quv5iAHXK8fl8pcOUlXE6qy4Xx0hfBUb6KXDSV4GTvgqM9FPgpK8CU7v95Ki0pDaDtEggu8z3RUopi67r3uLvFwGvAznAp0qpGwOoU47dbs1ISmq+p5rbLoQQQghREyrN71WbQVoOx4eLppJgSymlATN1Xc8u/v5L4Lyq6lTh1JZOCCGEEEIEodqck/YTcD2AUuoi4PcyZZHAH0qpiOKArR+w7iR1hBBCCCEaLK22NsIrs7qzG6AB9+FfLBCh6/pcpdTdwMP4V3F+p+v6pIrq6Lq+tVYaLIQQQghRh2otSBNCCCGEEIGTtFBCCCGEEEFIgjQhhBBCiCDU6NJCnSzzgThGKbWBY1ugpOi6fl9dticYKaUuBF7Sdf1ypVQHYD5gAH8AI3Rd99Vl+4LJCX3VA1iKP6sIwD91Xf+o7lpX95RSVuBtoA1gB6YAm5FnqpxK+mo/8kyVo5QyA28CCijCPx9cQ56r41TST1HU8TPV6II0ymQxKF4xOgO4uW6bFHyUUiEAuq5fXsdNCVpKqdHA3UB+8aFXgAm6rv+glHoD/3P1aV21L5hU0Fc9gFd0XZcsIsf8DcjUdf1upVQssAH4DXmmKlJRX01GnqmKDADQdf0SpdTl+P+d0pDn6kQV9dNS6viZaoyvO4/LYgCcNItBI9UdCFNKLVdK/bc4oBXH2wn8qcz35wMrir/+Cuhf6y0KXhX11Q1KqR+VUvOUUpVvud14fAxMLPO9F3mmKlNZX8kzdQJd1z8DhhV/mwQcQp6rcqropzp9phpjkFZhFoO6akwQKwCmA9cADwALpZ+Op+v6EsBT5pCm63rJculc/EPlggr76lfgSV3X+wK7gEl10rAgout6nq7rucW/CP4FTECeqQpV0lfyTFVC13WvUmoB8Cr+/pLnqgIV9FOdP1ONMUg7nSwGjdE24H1d1w1d17cBmUBCHbcp2JWd0+EAsuqoHfXBp7quryv5Gn+GkUZPKZUIfA+8p+v6B8gzVakK+kqeqSrouj4I6Ih/3lVomSJ5rso4oZ+W1/Uz1RiDNMliEJjB+OfroZRqgX8EMrVOWxT8NhTPZQC4DvhfHbYl2P1HKXVB8ddX4s8w0qgppZoDy4Exuq6/XXxYnqkKVNJX8kxVQCl1t1JqXPG3BfgD/7XyXB2vkn76pK6fqcb4+upT4Cql1CqOZT4Q5c0D5iulVuJfATRYRhxP6nHgTaWUDdiCf7hcVOxB4DWllBtI49hckMZsPBADTFRKlcy3GgX8Q56pcirqq8eAmfJMlfMJ8I5S6kfACjyC/1mSf6uOV1E/7aOO/52SjANCCCGEEEGoMb7uFEIIIYQIehKkCSGEEEIEIQnShBBCCCGCkARpQgghhBBBSII0IYQQQogg1Bi34BBCNAJKqd3407tUJFnX9a610AYDuFvX9fdr+l5CiIZHgjQhREP2EjCzguOeCo4JIURQkSBNCNGQ5em6nlbXjRBCiNMhQZoQolFSSrUBUoC7gIn4X43+Cjyk6/rvxedY8O9kPxRIBLYDz+m6vrjMda4DngHOAQ4Dr+u6/nKZW52tlPoBuAj/ruWTy6QyEkKISsnCASFEY/cKMAHohT/R9LdKqagyZU8C44BuwIfAIqXUrQBKqd7AF/jzSJ4LPApMUkoNLXP9EcBs4Gzg3/jT8bSt2Y8khGgIJC2UEKJBKl44kEDF888ewx9YpQAP67r+anGdKGA/8AT+gCwTGKHr+twy1/0IaKfrei+l1IdAgq7rl5cpvwfw6rr+QfHCgRd0XX+quCwGOALcquv6J9X8kYUQDYy87hRCNGSv4x/FOlE6/gTdACtKDuq6nq2U2oL/1eUG/P9G/nRC3R+Bm4q/PgdYVrZQ1/V3Tzh/W5myo0opgNBT+hRCiEZJgjQhREN2RNf1HRUVFI9qQfmRNjPgAworuaa5TJ1AVokWVXBMC6CeEKKRkzlpQojG7vySL4oDN4V/FG074Ab6nHB+H2Bz8ddbgJ5lC5VSU5RSn9VUY4UQjYeMpAkhGrIIpVR8JWUlo1kvKqUOAweBqUAGsFjXdadS6hVgilIqE9gI/Am4FbizuO50YI1SagKwCOgOPAI8XBMfRgjRuMhImhCiIRsDpFbyJ7b4nLn45679gj9wu0LX9fzisonAHPwb4v6OPzi7U9f1jwF0XV+PP3C7HUgGpgHjZYsNIUR1kNWdQohGqcw+aZfqur6yjpsjhBDlyEiaEEIIIUQQkiBNCCGEECIIyetOIYQQQoggJCNpQgghhBBBSII0IYQQQoggJEGaEEIIIUQQkiBNCCGEECIISZAmhBBCCBGEJEgTQgghhAhC/w/XDYCl8UM8UgAAAABJRU5ErkJggg==\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"a = plt.axes(aspect='equal')\n",
"sns.scatterplot(test_df['MPG'].values, mpg_hat_df['MPG_predictions'].values,\n",
" s=50)\n",
"plt.xlabel('True Values [MPG]')\n",
"plt.ylabel('Predictions [MPG]')\n",
"lims = [0, 50]\n",
"plt.xlim(lims)\n",
"plt.ylim(lims)\n",
"_ = plt.plot(lims, lims)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compare K-fold Cross Validation metrics against hold-out test metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hold-out Test Metrics"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'loss': 8.303831,\n",
" 'error': -0.45136052,\n",
" 'mean_squared_error': 8.303831,\n",
" 'mean_absolute_error': 2.2274728,\n",
" 'r2': 0.8558148}"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_stats['MPG']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### K-fold Cross Validation Metrics"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'loss_mean': 8.111681,\n",
" 'loss_std': 2.4598064,\n",
" 'error_mean': 0.0380627,\n",
" 'error_std': 0.5965346,\n",
" 'mean_squared_error_mean': 8.111682,\n",
" 'mean_squared_error_std': 2.4598064,\n",
" 'mean_absolute_error_mean': 2.0598435,\n",
" 'mean_absolute_error_std': 0.2779836,\n",
" 'r2_mean': 0.8666786,\n",
" 'r2_std': 0.03552912}"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kfold_cv_stats['overall']['MPG']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
================================================
FILE: examples/lbfgs/config.yaml
================================================
input_features:
- name: RESOURCE
type: category
- name: MGR_ID
type: category
- name: ROLE_ROLLUP_1
type: category
- name: ROLE_ROLLUP_2
type: category
- name: ROLE_DEPTNAME
type: category
- name: ROLE_TITLE
type: category
- name: ROLE_FAMILY_DESC
type: category
- name: ROLE_FAMILY
type: category
- name: ROLE_CODE
type: category
output_features:
- name: ACTION
type: binary
preprocessing:
split:
type: fixed
defaults:
category:
encoder:
type: sparse
trainer:
batch_size: 32769 # entire training set
train_steps: 1
steps_per_checkpoint: 1
learning_rate: 1
regularization_lambda: 0.0000057
optimizer:
type: lbfgs
max_iter: 100
tolerance_grad: 0.0001
history_size: 10
================================================
FILE: examples/lbfgs/model.py
================================================
import logging
import pandas as pd
from ludwig.api import LudwigModel
from ludwig.datasets import amazon_employee_access_challenge
df = amazon_employee_access_challenge.load()
model = LudwigModel(config="config.yaml", logging_level=logging.INFO)
training_statistics, preprocessed_data, output_directory = model.train(
df,
skip_save_processed_input=True,
skip_save_log=True,
skip_save_progress=True,
skip_save_training_description=True,
skip_save_training_statistics=True,
)
# Predict on unlabeled test
config = model.config
config["preprocessing"] = {}
model.config = config
unlabeled_test = df[df.split == 2].reset_index(drop=True)
preds, _ = model.predict(unlabeled_test)
# Save predictions to csv
action = preds.ACTION_probabilities_True
submission = pd.merge(unlabeled_test.reset_index(drop=True).id.astype(int), action, left_index=True, right_index=True)
submission.rename(columns={"ACTION_probabilities_True": "Action", "id": "Id"}, inplace=True)
submission.to_csv("submission.csv", index=False)
================================================
FILE: examples/llama2_7b_finetuning_4bit/README.md
================================================
# Llama2-7b Fine-Tuning 4bit (QLoRA)
[](https://colab.research.google.com/drive/1c3AO8l_H6V_x37RwQ8V7M6A-RmcBf2tG?usp=sharing)
This example shows how to fine-tune [Llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) to follow instructions.
Instruction tuning is the first step in adapting a general purpose Large Language Model into a chatbot.
This example uses no distributed training or big data functionality. It is designed to run locally on any machine
with GPU availability.
## Prerequisites
- [HuggingFace API Token](https://huggingface.co/docs/hub/security-tokens)
- Access approval to [Llama2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- GPU with at least 12 GiB of VRAM (in our tests, we used an Nvidia T4)
## Running
### Command Line
Set your token environment variable from the terminal, then run the API script:
```bash
export HUGGING_FACE_HUB_TOKEN=""
./run_train.sh
```
### Python API
Set your token environment variable from the terminal, then run the API script:
```bash
export HUGGING_FACE_HUB_TOKEN=""
python train_alpaca.py
```
## Upload to HuggingFace
You can upload to the HuggingFace Hub from the command line:
```bash
ludwig upload hf_hub -r / -m
```
================================================
FILE: examples/llama2_7b_finetuning_4bit/llama2_7b_4bit.yaml
================================================
model_type: llm
base_model: meta-llama/Llama-2-7b-hf
quantization:
bits: 4
adapter:
type: lora
input_features:
- name: instruction
type: text
output_features:
- name: output
type: text
trainer:
type: finetune
learning_rate: 0.0003
batch_size: 2
gradient_accumulation_steps: 8
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
backend:
type: local
================================================
FILE: examples/llama2_7b_finetuning_4bit/run_train.sh
================================================
#!/usr/bin/env bash
# Fail fast if an error occurs
set -e
# Get the directory of this script, which contains the config file
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
# Train
ludwig train --config ${SCRIPT_DIR}/llama2_7b_4bit.yaml --dataset ludwig://alpaca
================================================
FILE: examples/llama2_7b_finetuning_4bit/train_alpaca.py
================================================
import logging
import os
import yaml
from ludwig.api import LudwigModel
config = yaml.safe_load("""
model_type: llm
base_model: meta-llama/Llama-2-7b-hf
quantization:
bits: 4
adapter:
type: lora
input_features:
- name: instruction
type: text
output_features:
- name: output
type: text
trainer:
type: finetune
learning_rate: 0.0003
batch_size: 2
gradient_accumulation_steps: 8
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
backend:
type: local
""")
# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)
# initiate model training
(
train_stats, # dictionary containing training statistics
preprocessed_data, # tuple Ludwig Dataset objects of pre-processed training data
output_directory, # location of training results stored on disk
) = model.train(
dataset="ludwig://alpaca",
experiment_name="alpaca_instruct_4bit",
model_name="llama2_7b",
)
# list contents of output directory
print("contents of output directory:", output_directory)
for item in os.listdir(output_directory):
print("\t", item)
================================================
FILE: examples/llm_base_model_dequantization/README.md
================================================
# Convert quantized base model to fp16
Ludwig has utility functions to convert nf4 quantized bitsandbytes base models back to fp16
for more efficient inference. This is desireable since inference with bitsandbytes is slow because
every forward pass through the model requires dequantizing the model weights from nf4 to fp16 layer
by layer and then quantizing it back to nf4 to keep memory usage constant.
By dequantizing the base model in fp16 upfront, you can get the same effect of the quantized weights
without sacrificing on inference performance.
## Visual Illustration
### Without dequantization upfront
| **Request 1:** | **Request 2:** | **Request 3:** |
| ------------------------------------------ | ------------------------------------------ | ------------------------------------------ |
| - Quantized bitsandbytes model | - Quantized bitsandbytes model | - Quantized bitsandbytes model |
| - Dequantization of layer 1 (nf4 to fp16) | - Dequantization of layer 1 (nf4 to fp16) | - Dequantization of layer 1 (nf4 to fp16) |
| - Forward Pass (using dequantized weights) | - Forward Pass (using dequantized weights) | - Forward Pass (using dequantized weights) |
| - Quantization of layer 1 (fp16 to nf4) | - Quantization of layer 1 (fp16 to nf4) | - Quantization of layer 1 (fp16 to nf4) |
| - Dequantization of layer 2 (nf4 to fp16) | - Dequantization of layer 2 (nf4 to fp16) | - Dequantization of layer 2 (nf4 to fp16) |
| - Forward Pass (using dequantized weights) | - Forward Pass (using dequantized weights) | - Forward Pass (using dequantized weights) |
| - Quantization of layer 2 (fp16 to nf4) | - Quantization of layer 2 (fp16 to nf4) | - Quantization of layer 2 (fp16 to nf4) |
| - ... | - ... | - ... |
| - Final Output | - Final Output | - Final Output |
### With dequantization upfront
| **Request 1:** | **Request 2:** | **Request 3:** |
| -------------------------------- | -------------------------------- | -------------------------------- |
| - Dequantized base model in fp16 | - Dequantized base model in fp16 | - Dequantized base model in fp16 |
| - Forward pass through layer 1 | - Forward pass through layer 1 | - Forward pass through layer 1 |
| - Forward pass through layer 2 | - Forward pass through layer 2 | - Forward pass through layer 2 |
| - ... | - ... | - ... |
| - Final Output | - Final Output | - Final Output |
## Running the example script
The example `phi_2_dequantization.py` shows how you how you can quantize and then dequantized Phi-2. This process
can be repeated for any other base model supported by Ludwig that is quantized using 4 bits nf4 bitsandbytes quantization. You will need a GPU to run the script successfully.
Beneath the surface, this script:
1. Loads the base model in 4 bit nf4 quantization
1. Dequantizes the model layer by layer back into fp16 in-place.
1. Write the new dequantized weights to disk at `save_path`
1. Write the tokenizer to disk at `save_path`
Make sure you update the paths at the top of the file for base model, save path, and huggingface repo ID!
## Bonus
If desired, you can also use Ludwig to push the new dequantized model weights straight to HuggingFace hub!
```python
from ludwig.utils.hf_utils import upload_folder_to_hfhub
upload_folder_to_hfhub(repo_id=hfhub_repo_id, folder_path=save_path)
```
### Dequantized base models already on huggingface hub
- [CodeLlama 7b Instruct](https://huggingface.co/arnavgrg/codallama-7b-instruct-nf4-fp16-upscaled)
- [CodeLlama 13b Instruct](https://huggingface.co/arnavgrg/codellama-13b-instruct-nf4-fp16-upscaled)
- [CodeLlama 70b Instruct](https://huggingface.co/arnavgrg/codellama-70b-instruct-nf4-fp16-upscaled)
- [Llama 2 7b](https://huggingface.co/arnavgrg/llama-2-7b-nf4-fp16-upscaled)
- [Llama 2 7b Chat](https://huggingface.co/arnavgrg/llama-2-7b-chat-nf4-fp16-upscaled)
- [Llama 2 13b Chat](https://huggingface.co/arnavgrg/llama-2-13b-chat-nf4-fp16-upscaled)
- [Llama 2 70b Chat](https://huggingface.co/arnavgrg/llama-2-70b-chat-nf4-fp16-upscaled)
- [Mistral 7b](https://huggingface.co/arnavgrg/mistral-7b-nf4-fp16-upscaled)
- [Mistral 7b Instruct](https://huggingface.co/arnavgrg/mistral-7b-instruct-nf4-fp16-upscaled)
- [NousMistral Yarn 7b 128K](https://huggingface.co/arnavgrg/NousResearch-Yarn-Mistral-7b-128k-nf4-fp16-upscaled)
- [Microsoft Phi-2](https://huggingface.co/arnavgrg/phi-2-nf4-fp16-upscaled)
- [Zephyr 7b Beta](https://huggingface.co/arnavgrg/zephyr-7b-beta-nf4-fp16-upscaled)
================================================
FILE: examples/llm_base_model_dequantization/phi_2_dequantization.py
================================================
import logging
import os
import yaml
from huggingface_hub import whoami
from ludwig.api import LudwigModel
from ludwig.utils.hf_utils import upload_folder_to_hfhub
hf_username = whoami().get("name")
base_model_name = "microsoft/phi-2"
dequantized_path = "microsoft-phi-2-dequantized"
save_path = "/home/ray/" + dequantized_path
hfhub_repo_id = os.path.join(hf_username, dequantized_path)
config = yaml.safe_load(f"""
model_type: llm
base_model: {base_model_name}
quantization:
bits: 4
input_features:
- name: instruction
type: text
output_features:
- name: output
type: text
trainer:
type: none
backend:
type: local
""")
# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)
model.save_dequantized_base_model(save_path=save_path)
# Optional: Upload to Huggingface Hub
upload_folder_to_hfhub(repo_id=hfhub_repo_id, folder_path=save_path)
================================================
FILE: examples/llm_few_shot_learning/simple_model_training.py
================================================
#!/usr/bin/env python
# # Simple Model Training Example
#
# This is a simple example of how to use the LLM model type to train
# a zero shot classification model. It uses the facebook/opt-350m model
# as the base LLM model.
# Import required libraries
import logging
import shutil
import pandas as pd
import yaml
from ludwig.api import LudwigModel
# clean out prior results
shutil.rmtree("./results", ignore_errors=True)
review_label_pairs = [
{"review": "I loved this movie!", "label": "positive"},
{"review": "The food was okay, but the service was terrible.", "label": "negative"},
{"review": "I can't believe how rude the staff was.", "label": "negative"},
{"review": "This book was a real page-turner.", "label": "positive"},
{"review": "The hotel room was dirty and smelled bad.", "label": "negative"},
{"review": "I had a great experience at this restaurant.", "label": "positive"},
{"review": "The concert was amazing!", "label": "positive"},
{"review": "The traffic was terrible on my way to work this morning.", "label": "negative"},
{"review": "The customer service was excellent.", "label": "positive"},
{"review": "I was disappointed with the quality of the product.", "label": "negative"},
{"review": "The scenery on the hike was breathtaking.", "label": "positive"},
{"review": "I had a terrible experience at this hotel.", "label": "negative"},
{"review": "The coffee at this cafe was delicious.", "label": "positive"},
{"review": "The weather was perfect for a day at the beach.", "label": "positive"},
{"review": "I would definitely recommend this product.", "label": "positive"},
{"review": "The wait time at the doctor's office was ridiculous.", "label": "negative"},
{"review": "The museum was a bit underwhelming.", "label": "neutral"},
{"review": "I had a fantastic time at the amusement park.", "label": "positive"},
{"review": "The staff at this store was extremely helpful.", "label": "positive"},
{"review": "The airline lost my luggage and was very unhelpful.", "label": "negative"},
{"review": "This album is a must-listen for any music fan.", "label": "positive"},
{"review": "The food at this restaurant was just okay.", "label": "neutral"},
{"review": "I was pleasantly surprised by how great this movie was.", "label": "positive"},
{"review": "The car rental process was quick and easy.", "label": "positive"},
{"review": "The service at this hotel was top-notch.", "label": "positive"},
]
df = pd.DataFrame(review_label_pairs)
df["split"] = [0] * 15 + [2] * 10
config = yaml.safe_load("""
model_type: llm
base_model: facebook/opt-350m
generation:
temperature: 0.1
top_p: 0.75
top_k: 40
num_beams: 4
max_new_tokens: 64
prompt:
task: "Classify the sample input as either negative, neutral, or positive."
retrieval:
type: semantic
k: 3
model_name: paraphrase-MiniLM-L3-v2
input_features:
-
name: review
type: text
output_features:
-
name: label
type: category
preprocessing:
fallback_label: "neutral"
decoder:
type: category_extractor
match:
"negative":
type: contains
value: "positive"
"neural":
type: contains
value: "neutral"
"positive":
type: contains
value: "positive"
preprocessing:
split:
type: fixed
""")
# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)
# initiate model training
(
train_stats, # dictionary containing training statistics
preprocessed_data, # tuple Ludwig Dataset objects of pre-processed training data
output_directory, # location of training results stored on disk
) = model.train(
dataset=df, experiment_name="simple_experiment", model_name="simple_model", skip_save_processed_input=True
)
training_set, val_set, test_set, _ = preprocessed_data
# batch prediction
preds, _ = model.predict(test_set, skip_save_predictions=False)
print(preds)
================================================
FILE: examples/llm_finetuning/README.md
================================================
# LLM Fine-tuning
These examples show you how to fine-tune Large Language Models by taking advantage of model parallelism
with [DeepSpeed](https://www.deepspeed.ai/), allowing Ludwig to scale to very large models with billions of
parameters.
The task here will be to fine-tune a large billion+ LLM to classify the sentiment of [IMDB movie reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). As such, we'll be taking a pretrained LLM, attaching a classification head,
and fine-tuning the weights to improve performance of the LLM on the task. Ludwig will do this for you without no machine learning
code, just configuration.
## Prerequisites
- Installed Ludwig with `ludwig[distributed]` dependencies
- Have a CUDA-enabled version of PyTorch installed
- Have access to a machine or cluster of machines with multiple GPUs
- The IMDB dataset used in these examples comes from Kaggle, so make sure you have your credentials set (e.g., `$HOME/.kaggle.kaggle.json`)
## Running DeepSpeed on Ray
This is the recommended way to use DeepSpeed, which supports auto-batch size tuning and distributed data processing.
There is some overhead from using Ray with small datasets (\<100MB), but in most cases performance should be comparable
to using native DeepSpeed.
From the head node of your Ray cluster:
```bash
./run_train_dsz3_ray.sh
```
### Python API
If you want to run Ludwig programatically (from a notebook or as part of a larger workflow), you can run the following
Python script using the Ray cluster launcher from your local machine.
```bash
ray submit cluster.yaml train_imdb_ray.py
```
If running directly on the Ray head node, you can omit the `ray submit` portion and run like an ordinary Python script:
```bash
python train_imdb_ray.py
```
## Running DeepSpeed Native
This mode is suitable for datasets small enough to fit in memory on a single machine, as it doesn't make use of
distributed data processing (requires use of the Ray backend).
The following example assumes you have 4 GPUs available, but can easily be modified to support your preferred
setup.
From a terminal on your machine:
```bash
./run_train_dsz3.sh
```
================================================
FILE: examples/llm_finetuning/imdb_deepspeed_zero3.yaml
================================================
input_features:
- name: review
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: bigscience/bloom-3b
trainable: true
adapter: lora
output_features:
- name: sentiment
type: category
trainer:
batch_size: 4
epochs: 3
gradient_accumulation_steps: 8
backend:
type: deepspeed
zero_optimization:
stage: 3
offload_optimizer:
device: cpu
pin_memory: true
================================================
FILE: examples/llm_finetuning/imdb_deepspeed_zero3_ray.yaml
================================================
input_features:
- name: review
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: bigscience/bloom-3b
trainable: true
adapter: lora
output_features:
- name: sentiment
type: category
trainer:
batch_size: 4
epochs: 3
gradient_accumulation_steps: 8
backend:
type: ray
trainer:
use_gpu: true
strategy:
type: deepspeed
zero_optimization:
stage: 3
offload_optimizer:
device: cpu
pin_memory: true
================================================
FILE: examples/llm_finetuning/run_train_dsz3.sh
================================================
#!/usr/bin/env bash
# Fail fast if an error occurs
set -e
# Get the directory of this script, which contains the config file
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
# Train
deepspeed --no_python --no_local_rank --num_gpus 4 ludwig train --config ${SCRIPT_DIR}/imdb_deepspeed_zero3.yaml --dataset ludwig://imdb
================================================
FILE: examples/llm_finetuning/run_train_dsz3_ray.sh
================================================
#!/usr/bin/env bash
# Fail fast if an error occurs
set -e
# Get the directory of this script, which contains the config file
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
# Train
ludwig train --config ${SCRIPT_DIR}/imdb_deepspeed_zero3_ray.yaml --dataset ludwig://imdb
================================================
FILE: examples/llm_finetuning/train_imdb_ray.py
================================================
import logging
import os
import yaml
from ludwig.api import LudwigModel
config = yaml.safe_load("""
input_features:
- name: review
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: bigscience/bloom-3b
trainable: true
adapter:
type: lora
output_features:
- name: sentiment
type: category
trainer:
batch_size: 4
epochs: 3
backend:
type: ray
trainer:
use_gpu: true
strategy:
type: deepspeed
zero_optimization:
stage: 3
offload_optimizer:
device: cpu
pin_memory: true
""")
# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)
# initiate model training
(
train_stats, # dictionary containing training statistics
preprocessed_data, # tuple Ludwig Dataset objects of pre-processed training data
output_directory, # location of training results stored on disk
) = model.train(
dataset="ludwig://imdb",
experiment_name="imdb_sentiment",
model_name="bloom3b",
)
# list contents of output directory
print("contents of output directory:", output_directory)
for item in os.listdir(output_directory):
print("\t", item)
================================================
FILE: examples/llm_instruction_tuning/train_alpaca_ray.py
================================================
import logging
import os
import yaml
from ludwig.api import LudwigModel
config = yaml.safe_load("""
model_type: llm
base_model: bigscience/bloomz-3b
adapter:
type: lora
input_features:
- name: instruction
type: text
output_features:
- name: output
type: text
trainer:
type: finetune
batch_size: 4
epochs: 3
backend:
type: ray
trainer:
use_gpu: true
strategy:
type: deepspeed
zero_optimization:
stage: 3
offload_optimizer:
device: cpu
pin_memory: true
""")
# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)
# initiate model training
(
train_stats, # dictionary containing training statistics
preprocessed_data, # tuple Ludwig Dataset objects of pre-processed training data
output_directory, # location of training results stored on disk
) = model.train(
dataset="ludwig://alpaca",
experiment_name="alpaca_instruct",
model_name="bloom560m",
)
# list contents of output directory
print("contents of output directory:", output_directory)
for item in os.listdir(output_directory):
print("\t", item)
================================================
FILE: examples/llm_text_generation/simple_model_training.py
================================================
#!/usr/bin/env python
# # Simple Model Training Example
#
# This is a simple example of how to use the LLM model type to train
# a model on a simple question and answer dataset. It uses the
# facebook/opt-350m model as the base LLM model.
# Import required libraries
import logging
import shutil
import pandas as pd
import yaml
from ludwig.api import LudwigModel
# clean out prior results
shutil.rmtree("./results", ignore_errors=True)
qa_pairs = [
{"Question": "What is the capital of Uzbekistan?", "Answer": "Tashkent"},
{"Question": "Who is the founder of Microsoft?", "Answer": "Bill Gates"},
{"Question": "What is the tallest building in the world?", "Answer": "Burj Khalifa"},
{"Question": "What is the currency of Brazil?", "Answer": "Real"},
{"Question": "What is the boiling point of mercury in Celsius?", "Answer": "-38.83"},
{"Question": "What is the most commonly spoken language in the world?", "Answer": "Mandarin"},
{"Question": "What is the diameter of the Earth?", "Answer": "12,742 km"},
{"Question": 'Who wrote the novel "1984"?', "Answer": "George Orwell"},
{"Question": "What is the name of the largest moon of Neptune?", "Answer": "Triton"},
{"Question": "What is the speed of light in meters per second?", "Answer": "299,792,458 m/s"},
{"Question": "What is the smallest country in Africa by land area?", "Answer": "Seychelles"},
{"Question": "What is the largest organ in the human body?", "Answer": "Skin"},
{"Question": 'Who directed the film "The Godfather"?', "Answer": "Francis Ford Coppola"},
{"Question": "What is the name of the smallest planet in our solar system?", "Answer": "Mercury"},
{"Question": "What is the largest lake in Africa?", "Answer": "Lake Victoria"},
{"Question": "What is the smallest country in Asia by land area?", "Answer": "Maldives"},
{"Question": "Who is the current president of Russia?", "Answer": "Vladimir Putin"},
{"Question": "What is the chemical symbol for gold?", "Answer": "Au"},
{"Question": "What is the name of the famous Swiss mountain known for skiing?", "Answer": "The Matterhorn"},
{"Question": "What is the largest flower in the world?", "Answer": "Rafflesia arnoldii"},
]
df = pd.DataFrame(qa_pairs)
config = yaml.safe_load("""
input_features:
- name: Question
type: text
output_features:
- name: Answer
type: text
model_type: llm
generation:
temperature: 0.1
top_p: 0.75
top_k: 40
num_beams: 4
max_new_tokens: 5
base_model: facebook/opt-350m
""")
# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)
# initiate model training
(
train_stats, # dictionary containing training statistics
preprocessed_data, # tuple Ludwig Dataset objects of pre-processed training data
output_directory, # location of training results stored on disk
) = model.train(
dataset=df, experiment_name="simple_experiment", model_name="simple_model", skip_save_processed_input=True
)
training_set, val_set, test_set, _ = preprocessed_data
# batch prediction
preds, _ = model.predict(test_set, skip_save_predictions=False)
print(preds)
================================================
FILE: examples/llm_zero_shot_learning/simple_model_training.py
================================================
#!/usr/bin/env python
# # Simple Model Training Example
#
# This is a simple example of how to use the LLM model type to train
# a zero shot classification model. It uses the facebook/opt-350m model
# as the base LLM model.
# Import required libraries
import logging
import shutil
import pandas as pd
import yaml
from ludwig.api import LudwigModel
# clean out prior results
shutil.rmtree("./results", ignore_errors=True)
review_label_pairs = [
{"review": "I loved this movie!", "label": "positive"},
{"review": "The food was okay, but the service was terrible.", "label": "negative"},
{"review": "I can't believe how rude the staff was.", "label": "negative"},
{"review": "This book was a real page-turner.", "label": "positive"},
{"review": "The hotel room was dirty and smelled bad.", "label": "negative"},
{"review": "I had a great experience at this restaurant.", "label": "positive"},
{"review": "The concert was amazing!", "label": "positive"},
{"review": "The traffic was terrible on my way to work this morning.", "label": "negative"},
{"review": "The customer service was excellent.", "label": "positive"},
{"review": "I was disappointed with the quality of the product.", "label": "negative"},
{"review": "The scenery on the hike was breathtaking.", "label": "positive"},
{"review": "I had a terrible experience at this hotel.", "label": "negative"},
{"review": "The coffee at this cafe was delicious.", "label": "positive"},
{"review": "The weather was perfect for a day at the beach.", "label": "positive"},
{"review": "I would definitely recommend this product.", "label": "positive"},
{"review": "The wait time at the doctor's office was ridiculous.", "label": "negative"},
{"review": "The museum was a bit underwhelming.", "label": "neutral"},
{"review": "I had a fantastic time at the amusement park.", "label": "positive"},
{"review": "The staff at this store was extremely helpful.", "label": "positive"},
{"review": "The airline lost my luggage and was very unhelpful.", "label": "negative"},
{"review": "This album is a must-listen for any music fan.", "label": "positive"},
{"review": "The food at this restaurant was just okay.", "label": "neutral"},
{"review": "I was pleasantly surprised by how great this movie was.", "label": "positive"},
{"review": "The car rental process was quick and easy.", "label": "positive"},
{"review": "The service at this hotel was top-notch.", "label": "positive"},
]
df = pd.DataFrame(review_label_pairs)
config = yaml.safe_load("""
model_type: llm
base_model: facebook/opt-350m
generation:
temperature: 0.1
top_p: 0.75
top_k: 40
num_beams: 4
max_new_tokens: 64
prompt:
task: "Classify the sample input as either negative, neutral, or positive."
input_features:
-
name: review
type: text
output_features:
-
name: label
type: category
preprocessing:
fallback_label: "neutral"
decoder:
type: category_extractor
match:
"negative":
type: contains
value: "positive"
"neutral":
type: contains
value: "neutral"
"positive":
type: contains
value: "positive"
""")
# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)
# initiate model training
(
train_stats, # dictionary containing training statistics
preprocessed_data, # tuple Ludwig Dataset objects of pre-processed training data
output_directory, # location of training results stored on disk
) = model.train(
dataset=df, experiment_name="simple_experiment", model_name="simple_model", skip_save_processed_input=True
)
training_set, val_set, test_set, _ = preprocessed_data
# batch prediction
preds, _ = model.predict(test_set, skip_save_predictions=False)
print(preds)
================================================
FILE: examples/mnist/README.md
================================================
# MNIST Hand-written Digit Classification
This API example is based on [Ludwig's MNIST Hand-written Digit image classification example](https://ludwig-ai.github.io/ludwig-docs/examples/#image-classification-mnist).
### Examples
| File | Description |
| ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| simple_model_training.py | Demonstrates using Ludwig api for training a model. |
| advance_model_training.py | Demonstrates a method to assess alternative model architectures. |
| assess_model_performance.py | Assess model performance on hold-out test data set. This shows how to load a previously trained model to make predictions. |
| visualize_model_test_results.ipynb | Example for extracting training statistics and generate custom visualizations. |
================================================
FILE: examples/mnist/advanced_model_training.py
================================================
#!/usr/bin/env python
# # Multiple Model Training Example
#
# This example trains multiple models and extracts training statistics
import glob
import logging
import os
import shutil
from collections import namedtuple
import yaml
# ## Import required libraries
from ludwig.api import LudwigModel
from ludwig.constants import TRAINER
from ludwig.datasets import mnist
from ludwig.visualize import learning_curves
# clean out old results
shutil.rmtree("./results", ignore_errors=True)
shutil.rmtree("./visualizations", ignore_errors=True)
file_list = glob.glob("./data/*.json")
file_list += glob.glob("./data/*.hdf5")
for f in file_list:
try:
os.remove(f)
except FileNotFoundError:
pass
# read in base config
with open("./config.yaml") as f:
base_model = yaml.safe_load(f.read())
# Specify named tuple to keep track of training results
TrainingResult = namedtuple("TrainingResult", ["name", "train_stats"])
# specify alternative architectures to test
FullyConnectedLayers = namedtuple("FullyConnectedLayers", ["name", "fc_layers"])
list_of_fc_layers = [
FullyConnectedLayers(name="Option1", fc_layers=[{"output_size": 64}]),
FullyConnectedLayers(name="Option2", fc_layers=[{"output_size": 128}, {"output_size": 64}]),
FullyConnectedLayers(name="Option3", fc_layers=[{"output_size": 128}]),
]
#
list_of_train_stats = []
# load and split MNIST dataset
training_set, test_set, _ = mnist.load(split=True)
# ## Train models
for model_option in list_of_fc_layers:
print(">>>> training: ", model_option.name)
# set up Python dictionary to hold model training parameters
config = base_model.copy()
config["input_features"][0]["fc_layers"] = model_option.fc_layers
config[TRAINER]["epochs"] = 5
# Define Ludwig model object that drive model training
model = LudwigModel(config, logging_level=logging.INFO)
# initiate model training
train_stats, _, _ = model.train(
training_set=training_set,
test_set=test_set,
experiment_name="multiple_experiment",
model_name=model_option.name,
)
# save training stats for later use
list_of_train_stats.append(TrainingResult(name=model_option.name, train_stats=train_stats))
print(">>>>>>> completed: ", model_option.name, "\n")
# generating learning curves from training
option_names = [trs.name for trs in list_of_train_stats]
train_stats = [trs.train_stats for trs in list_of_train_stats]
learning_curves(
train_stats, "Survived", model_names=option_names, output_directory="./visualizations", file_format="png"
)
================================================
FILE: examples/mnist/assess_model_performance.py
================================================
#!/usr/bin/env python
#
# Load a previously saved model and make predictions on the test data set
#
import os.path
# ## Import required libraries
import pandas as pd
from sklearn.metrics import accuracy_score
from ludwig.api import LudwigModel
from ludwig.datasets import mnist
# create data set for predictions
test_data = {"image_path": [], "label": []}
dataset = mnist.Mnist()
test_dir = os.path.join(dataset.processed_dataset_path, "testing")
for label in os.listdir(test_dir):
files = os.listdir(os.path.join(test_dir, label))
test_data["image_path"] += [os.path.join(test_dir, label, f) for f in files]
test_data["label"] += len(files) * [label]
# collect data into a data frame
test_df = pd.DataFrame(test_data)
print(test_df.head())
# retrieve a trained model
model = LudwigModel.load("./results/multiple_experiment_Option3/model")
# make predictions
pred_df, _ = model.predict(dataset=test_df)
print(pred_df.head())
# print accuracy on test data set
print("predicted accuracy", accuracy_score(test_df["label"], pred_df["label_predictions"]))
================================================
FILE: examples/mnist/config.yaml
================================================
input_features:
- name: image_path
type: image
preprocessing:
num_processes: 4
encoder: stacked_cnn
conv_layers:
- num_filters: 32
filter_size: 3
pool_size: 2
pool_stride: 2
- num_filters: 64
filter_size: 3
pool_size: 2
pool_stride: 2
dropout: 0.4
fc_layers:
- output_size: 128
dropout: 0.4
output_features:
- name: label
type: category
trainer:
epochs: 5
================================================
FILE: examples/mnist/simple_model_training.py
================================================
#!/usr/bin/env python
# # Simple Model Training Example
#
# This example is the API example for this Ludwig command line example
# (https://ludwig-ai.github.io/ludwig-docs/latest/examples/mnist/).
import logging
import shutil
import yaml
from ludwig.api import LudwigModel
from ludwig.datasets import mnist
# clean out prior results
shutil.rmtree("./results", ignore_errors=True)
# set up Python dictionary to hold model training parameters
with open("./config.yaml") as f:
config = yaml.safe_load(f.read())
# Define Ludwig model object that drive model training
model = LudwigModel(config, logging_level=logging.INFO)
# load and split MNIST dataset
training_set, test_set, _ = mnist.load(split=True)
# initiate model training
train_stats, _, output_directory = model.train( # training statistics # location for training results saved to disk
training_set=training_set,
test_set=test_set,
experiment_name="simple_image_experiment",
model_name="single_model",
skip_save_processed_input=True,
)
================================================
FILE: examples/mnist/visualize_model_test_results.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ludwig Visualization Demonstration"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.simplefilter('ignore')\n",
"from ludwig.api import LudwigModel\n",
"from ludwig.datasets import mnist\n",
"from ludwig.visualize import compare_performance, compare_classifiers_performance_from_pred, \\\n",
" confusion_matrix\n",
"from ludwig.utils.data_utils import load_json\n",
"import pandas as pd\n",
"import os\n",
"import os.path\n",
"import shutil\n",
"\n",
"shutil.rmtree('./viz2', ignore_errors=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare test data set for use"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" image_path label\n",
"0 /opt/project/examples/mnist/data/mnist_png/tes... 0\n",
"1 /opt/project/examples/mnist/data/mnist_png/tes... 0\n",
"2 /opt/project/examples/mnist/data/mnist_png/tes... 0\n",
"3 /opt/project/examples/mnist/data/mnist_png/tes... 0\n",
"4 /opt/project/examples/mnist/data/mnist_png/tes... 0\n"
]
}
],
"source": [
"# create test dataframe\n",
"test_data = {'image_path': [], 'label': []}\n",
"dataset = mnist.Mnist()\n",
"test_dir = os.path.join(dataset.processed_dataset_path, 'testing')\n",
"for label in os.listdir(test_dir):\n",
" files = os.listdir(os.path.join(test_dir, label))\n",
" test_data['image_path'] += [os.path.join(test_dir, label, f) for f in files]\n",
" test_data['label'] += len(files) * [label]\n",
"\n",
"# collect data into a data frame\n",
"test_df = pd.DataFrame(test_data)\n",
"print(test_df.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate predictions the test data set for the different neural network options"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# get list of models to visualize results\n",
"models_list = ['Option1', 'Option2', 'Option3']\n",
"test_stats_list = []\n",
"preds_list = []\n",
"\n",
"for m in models_list:\n",
" # retrieve a trained model\n",
" model = LudwigModel.load('./results/multiple_experiment_'+ m + '/model')\n",
"\n",
" # make predictions\n",
" test_stats, pred_df, _ = model.evaluate(dataset=test_df, collect_predictions=True, collect_overall_stats=True)\n",
" \n",
" # collect test statsitics\n",
" preds_list.append(pred_df['label_predictions'].astype('int'))\n",
" test_stats_list.append(test_stats)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Show model performance on test data set"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAAEYCAYAAAAJeGK1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nOzdeVwV1f/H8dflApcrIJuCCC647wluuOKC4oZ7pSn6TSszNU3NzF+laZYtSrlWZllpaqVZ6Df3LXFJEHPDDUFBNgWR9e7z+4Ovt0jNNbnq5/l4+IA7Z5Yz5/qYNzNz5oxKURQFIYQQwsbYlXYFhBBCiBuRgBJCCGGTJKCEEELYJAkoIYQQNkkCSgghhE2SgBJCCGGTJKAecpcvX2bw4MEEBgYye/bs0q6OuImYmBjCwsJKuxp3rXbt2pw/f/6W86WkpFC7dm1MJtMdb+NelhWPJvvSrsDjqGPHjly+fBm1Wo1Wq6Vdu3a8+eabODs73/G6Vq9ejYeHB4cOHUKlUv0LtRX3Q9OmTdm0aVNpV0OIh4qcQZWSTz/9lLi4OH766SeOHTvG4sWL72h5RVGwWCykpqZSvXr1uwon+Uv1wZB2FuLuSECVMh8fH9q2bcuZM2cAOHz4MAMHDqRp06b06tWLAwcOWOeNiIggMjKSgQMH8sQTTzB58mTWrVvH0qVLCQwMZO/evRgMBmbNmkWbNm1o06YNs2bNwmAwAHDgwAHatWvH559/TuvWrXn99deZP38+L7/8MpMmTSIwMJDw8HASExP57LPPaNmyJSEhIezZs8dahzVr1tCtWzcCAwPp1KkTq1atspZdW/+XX35Jy5YtadOmDWvWrLGW63Q6Zs+eTYcOHWjSpAmDBg1Cp9Pdcr//Li0tjTFjxhAcHEyLFi2YMWMGABaLhUWLFtGhQwdatmzJ5MmTycvLA/68fLRmzRpCQkJo1qwZK1eu5MiRI4SHh9O0aVPregDWrl3LwIEDmTFjBk2aNKFr167s27fvjtrhr+18bdo1n3/+OW3btiUwMJCwsDDrum/n+7tZ+/5dRkYGL774Is2bN6dz5858//331rL58+czbtw4Jk+eTGBgID169ODo0aM3Xddf7dy5kz59+hAUFERISAjz58+/bp41a9ZY92Hp0qXW6RaLhc8//5zQ0FBatGjBuHHjyMnJua3tiseQIh64Dh06KNHR0YqiKEpqaqrSvXt3JTIyUklPT1eaN2+u7Ny5UzGbzcqePXuU5s2bK1lZWYqiKMqQIUOUkJAQ5fTp04rRaFQMBoPy2muvKXPnzrWu++OPP1aefPJJ5fLly0pWVpby9NNPK5GRkYqiKMr+/fuVunXrKh988IGi1+uVoqIiZd68eUqDBg2U3bt3K0ajUXn11VeVDh06KIsWLVIMBoOyevVqpUOHDtb179ixQzl//rxisViUAwcOKI0aNVKOHTtWYv0ff/yxYjAYlJ07dyqNGjVScnJyFEVRlOnTpytDhgxR0tPTFZPJpMTGxip6vf6W+/1XJpNJCQ8PV2bNmqUUFBQoOp1OOXjwoKIoivLDDz8ooaGhyoULF5T8/Hxl9OjRyqRJkxRFUZTk5GSlVq1ayptvvqnodDrlt99+Uxo0aKCMGjVKuXz5spKenq4EBwcrBw4cUBRFUdasWaPUrVtX+eqrrxSDwaBs2LBBCQoKUq5cuXLb7fDXdt6/f7/Stm1bRVEUJSEhQWnXrp2Snp5urdv58+dv+/u7Wfv+3TPPPKNMmzZN0el0yokTJ5QWLVooe/fuVRRFsX7vO3fuVEwmk/LRRx8pTz755E3/z9aqVUtJSkqy1uPkyZOK2WxW4uPjlZYtWypbtmwp0c6vvPKKUlBQoJw8eVJp0aKF9f/7smXLlCeffFJJS0tT9Hq98uabbyqvvPJKiWWNRuNN6yEeLxJQpaBDhw5K48aNlSZNmijt27dXpk2bphQVFSmfffaZ9YB6zfDhw5W1a9cqilIcUB9//HGJ8r8HVKdOnZSdO3daP+/evdsaMPv371fq16+v6HQ6a/m8efOU//znP9bP27ZtUxo3bqyYTCZFURQlLy9PqVWrlnL16tUb7suoUaOUZcuWWdffsGHDEgeY4OBgJS4uTjGbzUrDhg2V+Pj469Zxq/3+q0OHDiktWrS44UFs6NChyvLly62fExISlHr16ilGo9F68LsWCoqiKM2bN1c2bNhg/TxmzBjlq6++UhSlOKBat26tWCwWa3n//v2Vn3766bba4e/t/NeASkpKUoKDg5Xo6GjFYDCUWM+tvr+bte/fpaamKnXq1FHy8vKs0z766CPltddeUxSl+HsfNmyYtezMmTNKw4YNb7hvilIyoP7unXfeUWbNmqUoyp8hc/bsWWv5+++/r7z++uuKoihK165drSGpKIqSkZFx3XckASWukU4SpWThwoW0atWqxLTU1FQ2btzIjh07rNNMJhMtWrSwfvb19f3H9WZmZlKxYkXr54oVK5KZmWn97OHhgUajKbGMl5eX9XcnJyc8PDxQq9XWzwCFhYWULVuWXbt2sXDhQpKSkrBYLOh0OmrVqmVd3t3dHXv7P/9babVaCgsLuXLlCnq9nkqVKl1X59vZ72vS0tKoWLFiiW38dd/9/Pysn/38/DCZTGRlZd1wXzUazXWfCwsLrZ99fHxK3Nv7a1veqh1u1M7XVKlShalTpzJ//nzOnj1LmzZtmDJlCj4+Prf8/m7WvjdqCzc3N1xcXEqs69ixY9bP5cqVs/7u5OSEXq/HZDLdsG3/6o8//uCjjz7izJkzGI1GDAYDXbt2LTHPX/+f+vn5cfr0aaD4ux49ejR2dn/eXbCzsyvxHQlxjdyDsiG+vr707t2bmJgY67/Dhw/zwgsvWOe5VWcIb29vUlNTrZ/T0tLw9va+7eX/icFg4OWXX2b48OFER0cTExNDu3btUG5jQPxrB+zk5OTrym5nv/86b1pa2g07Hnh7e3Px4kXr59TUVOzt7UuE0J3IyMgosW/X2vJ22uFW7RweHs7KlSvZsWMHKpWKjz76yLoP//T93S5vb2+uXr1Kfn5+iXX5+Pjc8br+buLEiXTq1Ildu3YRGxvLwIEDr/s/kJaWZv09NTXVug8VKlRgyZIlJb7ro0eP3pd6iUePBJQN6dWrFzt27OC3337DbDaj1+s5cOAA6enpt72OHj16sHjxYrKzs8nOzmbhwoWEh4ffl/oZDAYMBgOenp7Y29uza9cuoqOjb2tZOzs7+vfvz3vvvUdGRgZms5m4uDgMBsMd7XejRo0oX748c+bMobCwEL1eT2xsLAA9e/bk66+/Jjk5mYKCAiIjI+nWrdstzwhuJjs7m2+++Qaj0civv/5KQkICISEh99QOAOfOnWPfvn0YDAYcHR3RaDTWM4r79f35+voSGBjI3Llz0ev1nDx5kh9//JFevXrd8br+rqCgADc3NzQaDUeOHGH9+vXXzbNo0SKKioo4c+YMa9eupXv37gAMGjSIjz/+2PqHRHZ2Nlu3br3nOolHk1zisyG+vr4sWrSIDz/8kIkTJ2JnZ0ejRo2YPn36ba/jpZdeoqCgwHog6tq1Ky+99NJ9qZ+LiwtvvPEG48ePx2Aw0KFDBzp27Hjby7/22mvMmTOHAQMGUFhYSJ06dVi6dOkd7bdarebTTz/lnXfeoUOHDkDx2UiTJk3o378/GRkZDBkyBL1eT5s2bXjzzTfven8bNWrE+fPnCQ4Oply5csybNw8PDw+Ae2oHg8HAnDlzSEhIwMHBgcDAQGsPwvv5/c2dO5dp06bRtm1bypYty9ixY6+7rHw3pk2bxvvvv8+MGTNo3rw53bp1Izc3t8Q813oOKorC8OHDadOmDQBDhw61TsvMzMTLy4vu3bsTGhp6z/USjx6VcjvXZ4R4zKxdu5YffviBlStXlnZVhHhsySU+IYQQNkkCSgghhE2SS3xCCCFskpxBCSGEsEmPbC++Q4cOodVq73g5s9lsfUj1dhmNRhwcHB7ItmR7trG9R3nfZHs3ptfrady48R1vS9y9Rzag1Go1devWvePlLl26RPny5e9omfPnz1OlSpUHsi3Znm1s71HeN9nejcXHx9/xdsS9kUt8QgghbJIElBBCCJskASWEEMImSUAJIYSwSRJQQgghbJIElBBCCJskASWEEMImSUAJIYSwSRJQQgghbNIjO1js8eMnqF+/XmlXQwjxiIiPj7+r0WnE3Xtkhzqys1NRdcqG0q6GEOIR8euwaqVdhceOXOITQghhkySghBBC2CQJKCGEEDZJAkoIIYRNkoASQghhkySghBBC2CQJKCGEEDbpkX0OSgjxcDFkniMv7ldMVy5i51QW57rt0NZsgcpOXWI+xWwk/8gWdMnHsBTl4VC+Cq6Nu+Hg6YdiMpD3xyb0KSew6PJx9A7AJbA7Du4VsBh15B/eiP5iPBZDEY4+1XEN7IZ9WW8Us4nCU3soPLUXi74Ah3KVcQ3sjoNXpVJqDQGP8EgS8fHxdPv6XGlXQwhxG3J//4krO5bi4uLCE088wfnz50lJScEpoAne/d9EpS7+W9qsyydj5esYMxOpWrUqPj4+HD58GL3RhEf7Z8n/YxPGrGSqV6+Ol5cXcXFxGC0KHh1GkBf7C6YradSsWRN3d3fi4uIwY4dXjwnkxfyC/uIJqlatSsWKFYmLi6NIp8er6xhcGnUBih/UlZEkHiy5xCeEKFXG7Itc2fkVAwYMICUlhT179pCUlMSCBQvQJcaSF/df67w5O7+CKyn88ssvJCYm8ttvv3HhwgU6hLTjyvYvUBdcYvPmzZw9e5Y9e/Zw/vx5WrVozpWtn6Ex5LJz505Onz5NdHQ0586dI7BRAy7/PBtDajzffvstiYmJREdHc+HCBcK6dCZ7y6eY86+UYus83iSghBClKu/QerROGhYuXMjly5fx9/fnm2++YfTo0bRr14682CgAFEWh8FQ0ERERhIeH88orr+Dh4YHZbGbx4sXY2dnx3HPP0blzZ1544QXKly+PRlO8XoAxY8YQEhLCkCFD8PX1xdPTk/nz5wPQt29fhgwZQmRkJAEBAZjNZhYtWoSdYibvj42l1jaPOwkoIUSpMlw6T2BgIN7e3mzdupWLFy+ydu1aAMLCwjDlpGEx6sBiwqLLp0qVKgAkJCRQUKQjIyOD2rVrU7lyZapWrWotu5qbR1ZWFo0bN6Z8+fLWsrNnz5KVlUVOTg6tWrXC2dmZsLAwAH744QeSkpL47bffqFatGjVq1MB4KelBN4n4HwkoIUSpshTmUKFCBQByc3NL/Lw2vejMAQzpZ1G7eLJlyxYAZs2axaeLFtK4cWMAfH192bx5MwAffvghXyz5nJo1a1rLNm3aBMAnn3zCsmXL8PPzs27jn7ZvLsj5F/de/BMJKCFEqbIr40ZmZiYArq6uJX5em3456kPSl7+KOT+bPXv20L17d44cOUKZMmXYt28fAMnJyWzevJk+ffpw8uRJ1Go1sbGxWCwWLl68yLp163jqqadITEzEZDJx9OhRDAYDGRkZ/7h9dRm3B9cYogQJKCFEqXIsV5m4uDiysrLo0KEDZcuWpUePHgBs27YNgKVLl7Jy5UoAHBwciI+PZ8iQIbz//vvUqVOH/fv3k5KSglar5dChQwwePJh58+ZRu3ZtduzYQVZWFi4uLkRHRzNo0CC++OILatSowaZNm8jPz7duJzw8HC8vL1q3bs2FCxc4e/YsDuUql07DCHkOSghRulwadyft0H+ZMGECn376KVevXgXg66+/ZuvWrQC0adOGsmXLFs/v4kJiYiL5+fm4uLhw+vRpRowYAYCXlxcXLlywlh0/fpwXX3wRAD8/P06ePGktO3z4MGPHjgXg+++/Z+DAgUydOpWpU6eSl5dHREQERouCyxNdH3STiP+RgBJClCrH8lVwa/MM33zzDRs2bKBZs2YkJiZy6tQpVPaOKCYD3bt3R60ufmD3ypUr1KlTh2rVqnHlyhUOHDgADlq0NVqQcvYA9erVIyAggEuXLnHw4EHsnFzQVm/G6dMxNGjQgCpVqpCRkUFsbCx2ZdzwGfQeV7Z/QZ8+fWjQoAF+fn7s27eP3NxcPEJHYl+2XCm30ONLHtQVQtgEXcoJ8uP+i/HKRey0xSNJONcNIf/YNnRJh0FR0FSqj0rtQNG5GMx5WagcNDhVbYzrE2HYacuSf3gjRYmHMOdnoXIsgzYgEJdGXbDTOJMXtwFd0mHMBVew05RBW60pzg07o9a6opgM5B/bTuGp6D9HkgjqiaZCDWv95EHdB08CSgghboME1IMnnSSEEELYJAkoIYQQNkkCSgghhE2SgBJCCGGTJKCEEELYJAkoIYQQNum2Aio9PZ1Ro0bRpUsXQkNDeeeddzAYDDedPzc3lxUrVlg/Z2Rk8PLLL991JSMjIwkJCSEwMPCu1yGEEOLhcsuAUhSFMWPGEBoayubNm9m0aROFhYVERkbedJnc3FzruFkAPj4+zJs3764r2aFDB3744Ye7Xl4IIcTD55ZDHe3fvx+NRkP//v0BUKvVTJ06lU6dOuHv78+ePXvIz88nIyODXr16MWbMGObMmcOFCxfo3bs3rVq1YvDgwbz44ousX78evV7P9OnTOXbsGGq1milTphAcHMzatWvZvn07RUVFJCcnExoayuTJkwGsw+kLIYR4fNwyoM6cOUP9+vVLTHNxccHX1xez2czRo0eJiopCq9UyYMAAQkJCmDhxImfOnOHnn38GICUlxbrstUt/UVFRJCQkMGLECOt7WuLj41m3bh2Ojo507dqViIgIfH1972rHFMVC0uwed7WsEEIAYNSBgxNQfHwSD9Y9DxbbqlUrPDw8AOjcuTOxsbGEhobedP7Y2FiGDBkCQPXq1alYsSKJiYkAtGzZ0voelurVq3Px4sW7DiiVyg6my3tchBD3YPrV0q7BY+2W96Bq1KjB8ePHS0zLz88nLS0NtVqNSqUqUfb3z3fC0dHR+rtarcZsNt/1uoQQQjzcbhlQLVu2pKioiHXr1gFgNpuZPXs2ffv2RavVEh0dTU5ODjqdjq1btxIUFISzszMFBQU3XF/Tpk2JiooCIDExkbS0NKpVq3Yfd0kIIcSj4JYBpVKpWLhwIRs3bqRLly6EhYWh0WiYMGECAI0aNWLs2LH06tWLsLAwGjZsiIeHB0FBQfTs2ZP333+/xPqeeeYZFEUhPDycV155hffee6/EmdONfPDBB7Rr146ioiLatWvH/Pnz72GXhRBCPAzu6XUba9eu5dixY7z11lv3s073RXx8PHVXB5d2NYQQD7O/3IOKj4+X1208YDKShBBCCJt0T734+vXrR79+/e5XXYQQQggrOYMSQghhk+75OSghhLgbiqLw+0UzMakWtA7Qpbo9/mVv/DfzlSKF3edNpOQqVHFX0aW6PY7q4kda0vIs7Dpv5lKBgocW2lS2p6r7n+s5ednMngtm9Cao6q6ibRV7HOzgwEUzp7MsGMwQ4K6iQ4A9ZRzu/jEZcf9JQAkhHriLuRYGriliz4U/n3W0t4PRzRyZ00WD2u7PoFh22MC4jTpy9X8u7+eq4qveWn6/aGb6Lj0my59lKmB0Mwdeb6th+M9FbEq4vecpvZ1VfNNHS1gNOSzaCrnEJ4R4oBRF4akfiziS48yiRYtIT0/n5MmTPD9yFJ8cMBC5/883Jfx23sTwn3U0bd2R6Oho0tPT2bBhA14BDemyvJA3duh58ulBHD58mEuXLnHixAnGjB3LgoNG/ObmE52hZfbs2SQmJpKRkcG2bdsA8PPzY/369SQnJ5ORkcHWrVupWPMJ+n9fyMVcy82qLh4wCSghxAO1PdHM3mQzH374IaNGjWLhwoXEx8ezaNEiunXrxvvRBvSm4qdf5v1uoKKfH1FRUTg4OPDcc89Rp04dfvnlFxwcHHBzc+Pbb79Fq9UyZMgQrl69yrx586zdwVesWMGECRP46quvGDVqFAcOHADA29sbe3t7ZsyYwZdffkmnTp344YcfKDDCupOmUmsbUZIElBDigdqZZEKtVjNs2DDi4+OZOXMmU6ZMAWD48OFcLlQ4can4LCb+koVmzZpRpkwZ1q1bx6b/rmfLli1UqVKFjh07oigKBoOBtLQ0du7YRlJSEgAGgwE/Pz969erF6tWr2b9/P5cvX7Y+s3n48GG6du3KkiVLeP3110lLS6NKlSqoVCouFcoZlK2QgBJCPFDJuQq+vr5oNBrS0tIArD8DAgIAuHC1OCR8XFScO3cOgG7dutGyTTvat28PQNWqVcnNzeX555+nTZs2FBbpGThwIBMnTiQhIYFmzZoBMGjQIFatWsWuXbs4cOAAZcqUQVEUfF1U1nJfX1+WLFmCoii08FM/sLYQ/0wCSgjxQDnZQ1FREQD29vYlfhYWFgLQZ3UR3h/msT3RzJEjR5gwYQJ16tRhy5Yt1vUUFRXh4+PDwoULiYmJITQ0lP/+97+8++671KtXzzoe6JEjRyhXrhwzZswgKCiIvn37ApCWrzB69GiWL1/OypUrGTduHC6OSCcJGyIBJYR4oKp72JGVlUVqaiq1a9fG3t7e+s65o0ePAsWv7hkw7EXKlCkDQGRkJOXLl8fJyYm4uDgAtmzZQsOGDXFzc2PLli0Unt5NVFQUGo2GZs2aER8fj8ViITMzE2d7i/Uszd7eHpVKxfvvv8+CBQuYM2cOgwcPxmQykW+AH0/IPShbcU9j8dkyGYtPCNt07oqFWvPzeX7kiyxevJiYmBgqVqyIm5sbTzzxBAkJCaxcuZKBAwdSsWJF0tLS2LVrFykpKQQEBNCyZUvmzp3LxIkT8fT05OTJkzg4OPDjjz/SvXt33N3dadiwIefOnWPp0qUMHTqU7777jk6dOqFSqWjUqBHVqlXj999/ByA9Pd1atwYNGvB0QC4Le2iLJ8hYfKVKzmWFEA9UNQ87Xm7hSOSnn3L69Gn69etHTk4OS5cutb68dPXq1Rw5coS8vDwAli1bRrNmzTh48CBTp05l586d1PS048LVbJo0acLQoUPx8/Pjiy++YOXKldb7ViNHjmTbtm20adOGhQsX8sUXX5CVlYVGo2Hq1KnX1a2oqIiyGnlY11bIGZQQ4oEzWxTm/25g7j4DybnFh6DWldTM7KAh6rSJz2MNGMxQw9MOb2cVMalmCozFywa4qxjdzJGXWzgSm2bm9W16diWZuXYga1rRjpkdnGhVSc2MXXqWHDKQqwe1CjoEqHmyngMf7TWQlHN9b70G3nasHqClptf/OkrIGVSpkoASQpQaRVHIKFDQ2qtwc7r5mYvZonCpUEGtgnJlVNe9udtoVrhcqOChVeFkf+MyrzIq6/BIt00CqlTJJT4hRKlRqVRUcLl1aKjt/nk+B7UKX9cbl/9TmbBt0otPCCGETZKAEkIIYZMkoIQQQtgkCSghhBA2SQJKCCGETZKAEkIIYZMkoIQQQtgkCSghhBA2SQJKCCGETXpkR5JQLJYSw5QIIcQdM+rAwam0a/HYemTPoIymu3uny6VLl+54mfPnzz+wbcn2bGN7j/K+yfb+QsKpVD2yASWEEOLhJgElhBDCJklACSGEsEkSUEIIIWySBJQQQgibJAElhBDCJklACSGEsEkSUEIIIWySBJQQQgibJAElhBDCJklACSGEsEkSUEIIIWySBJQQQgibpFIURSntSvwbjh8/Qf369Uq7GkKIR0R8fDx169Yt7Wo8Vh7Z90HZ2amoOmVDaVdDCPGI+HVYtdKuwmNHLvEJIYSwSRJQQgghbJIElBBCCJskASWEEMImSUAJIYSwSRJQQgghbJIElBBCCJv0yD4HJYR4eCgWM/lHtpAX919M2Rex07riXKctZYMHoC7jVmJeU24mOb99hy75KJaiXBzKVaZsk3DK1A3BlJPO1T0r0KWcwKLLx7F8VVyb9qJM7daYsi+Ss2cF+osnsegLcPSpRtlmfXD0rkb21s8wZqeU2I7KXkPZ5n1xqd/hQTaF+AsJKCFEqcv6dR4Fx7bRtGlT2g/tR2JiIj///DMFp6LxHRZpDSnjlVTSv5mAk52Fp3r3xsfHh82bN3Mi6iOcjm5Dn3oKF42agb16Ua5cOTZu3Mipn2ejrdYUXfJR3Jy19O8TjoeHBxs2bCBh7TsAODs7MyA8vESdTp48yR+bF+Ncpy0qtRwqS4O0uhCiVOlSTlBwbBtvvPEGM2fOJDk5GV9fXw4fPkxwcDC5+3/Eo+MIAK5s/wI3rQMxMTF4enqSkJBAZGQko0ePZtGiRXh7exMbG4tWq+XChQt8/PHHPPvssyxbtgx/f39iYmJQqVSkp6cTGRnJoEGD+P777ylXrhwrV64sUa+PPvqIw6++imI2SkCVErkHJYQoVQVHt+Lp6cnrr79ObGwsVapUYdq0aTRt2pSnn36a/CObURQFRVHQnT9CREQE1apVY8yYMTRt2pT4+HhmzpyJk5MTI0aMwN/fn+eee46goCCSkpJ47733sLe3Z+TIkfj4+BAREUFgYCCXL1/mgw8+KFGXF154Aa1Wi1ar5fXXXwdA5aApjWYRyBmUEKKUGbMvEtSgAWXKlOH3339HURQOHDgAQPPmzfnuu++wFOVi5+SCYjbi7OwMgEqlsv709PSkWrVqJcqu/atQoQL+/v7XlQFUqVKF8uXLW+vy4Ycf8u677xIbG8v48eM5efIklqK86+6DWetuNJKSkoJOp/t3Gucx4OTkhL+/Pw4ODteVSUAJIUqVxVCIu3t1AOuB/tpPd3d3AK5Gf4e9WwUAVq1axWuvvcaCBQuYOHEiderUAcDT05MVK1Ywfvx4li5dyttvv02VKlUA8PDw4Ntvv2XUqFEsX76cjIwMvL29rWVXr17lww8/JC4ujsaNGzN58mRWr17NE088QeGpaFwDu9+w7ikpKbi6ulK1alVr6InbpygKWVlZpKSkEBAQcF25BJQQolTZu3iRnJwMQLly5Ur8TEkp7lmXd+jPNxMkJiZSv359evfuDUB4eDidO3cmPj6erKws6tevT3h4OEajkYEDB9KyZUvOnj1LXl4eDRo0oEePHhQVFTF8+HDq16/P+fPn0ev1TJ48GYCVK1fSr18/GjVqhEajwZSTftO663Q6Cad7oFKp8PLy4tKlSzzgE1AAACAASURBVDcsl3tQQohS5ehbiz/++INTp07RrVs3wsLCGDlyJADff/89AAcPHuTcuXNA8SWhkJAQdu/eTVFREZ06dWL9+vVkZWVRtmxZgoOD2blzJwBt27ZlzZo15OXl4eXlRePGjdm+fTtarZbg4GBWrVqFXq+nb9++TJo0iZYtWzJy5EgCAgI4c+YMer0eddnyN6z3NRJO9+af2k/OoIQQpco1qAe5B38iIiKCzz77jI0bN3Lp0iXGjh3LkSNHrptfrVYzf/58PD09MRgMrF69mnHjxgHg6OjIp59+iru7Ozqdjm+++YZXXnkFAK1Wy5dffknZsmUpLCzks88+Y+LEiQAUFhYybtw4PvzwQwBiYmJ44YUXUDlqKVOr1QNqCfF3ElBCiFKlLuNGufBXiYn6iKCgINzc3MjPz8dsNqOtGUzRmQM0a9bMOn9BQQFeXl54eHhQWFiIXq/HwTsAj479uLz9Czw9PXF3d6egoACDwYBjhZq4B/YlZecy3N3dcXd3Jz8/H6PRiMavLq61q7Fp0wYqVaqEq6srBoMBvV6PnZML5XtPwd7V67b3RWc04+Sgvm9tc7/X97CRgBJClLoyNZrj/9JXFBzfiTE7BRcnV5zrtsXBqxKGS0nokg6DoqCp1ACV2p6ic7GY8i6jcXDCrcoTOFV9ApXKDk2lBuiS4jDlZeHkqMWjaiCayg1RqVQ4VWqI7vwfmPKz0WrK4BkQhMa/PiqVCtfAnhSdP4zpShoaewdcPP0oU7s1dhrnO9oPJwf1fX2Td9LsHrecJyUlhRdffJH169eXmP7JJ5/QrFkzWrVqxbJly3j66afRarX3pV5bt26latWq1KhR46bzREREMHnyZBo2bHjX25GAEkLYBDuNM65B1x+QHctXxbF81ZLTvK/v8QWgqVADTYUbHzQ1FWujqVj7hmUO5SrhUK7SnVXYxl277AnwzTff0KtXr/saUO3bt//HgLofJKCEEOIhZzabeeONN4iLi8PHx4dFixYxffp02rdvT2ZmJpmZmQwbNgx3d3eWLVvG//3f/3Hs2DFUKhX9+/fnP//5zw3X+/3337N69WqMRiNVqlThgw8+ID4+nu3bt/P777+zePFi5s+fT+XKlW9aN4vFwtSpU/Hx8bHeD7xdElBCCPGQO3/+PHPnzuWdd95h3LhxbNq0yVo2dOhQli1bxtdff42npyfHjh0jIyPDekkwNzf3puvt3LkzTz31FACRkZH8+OOPRERE0LFjR9q3b0/Xrl3/sV5ms5lJkyZRs2ZNRo0adcf7Jd3MhRDiIefv70/dunUBqF+/PhcvXrzpvJUqVSI5OZmZM2eye/duXFxcbjrvmTNneOaZZwgPDycqKoozZ87cUb3eeuutuw4nkIASQoiHnqOjo/V3tVqN2Wy+6bxubm78/PPPNG/enFWrVvF///d/N513ypQpvPXWW0RFRTFmzBgMBsMd1SswMJADBw6g1+vvaLlr5BKfEELcJzqj+bZ63t3J+u5HN3NnZ2cKCgrw9PQkOzsbR0dHwsLCCAgI4NVXX73pcgUFBZQvXx6j0UhUVBQ+Pj4l1ncrAwYMICYmhnHjxrFgwQLs7e8scuQMSggh7pP7/czS/VrfU089xXPPPUdERASZmZlERETQu3dvXn31VSZMmHDT5caNG8eTTz7JoEGDqFatmnV69+7dWbp0KX369OHChQv/uO1nn32WevXqMXnyZCwWyx3VW6UoinKrmdLT03n77bdJSEjAYrHQvn17Jk+eXOK08q9yc3OJiopi8ODBAGRkZDBr1izmzZt3R5UDKCoqYty4cVy4cAG1Wk2HDh2YNGnSLZeLj4+n29fn7nh7QghxI78Oq2a9z3NNfHz8ddPEnbtZO97yDEpRFMaMGUNoaCibN29m06ZNFBYWEhkZedNlcnNzS7z8y8fH567C6Zrhw4ezceNGfvrpJw4dOsSuXbvuel1CCCEeDre8ILh//340Gg39+/cHim/ATZ06lU6dOuHv78+ePXvIz88nIyODXr16MWbMGObMmcOFCxfo3bs3rVq1YvDgwdYnnfV6PdOnT+fYsWOo1WqmTJlCcHAwa9euZfv27RQVFZGcnExoaCiTJ0+2DuoIxTcC69WrR0ZGxr/bKkII8Rh5++23OXToUIlpQ4cOtR73/61lb+WWAXXmzBnq169fYpqLiwu+vr6YzWaOHj1KVFQUWq2WAQMGEBISwsSJEzlz5gw///wz8OeQ+QArVqwAICoqioSEBEaMGGHtsx8fH8+6detwdHSka9euRERE4Ovra102NzeXHTt2MGzYsFvumKJY7uvNSiHEY8ioAwcnoPj49KiaNm1aqSx7K/fci69Vq1Z4eHgAxQ91xcbGEhoaetP5Y2NjGTJkCADVq1enYsWKJCYmAtCyZUtcXV2tZRcvXrQGlMlkYsKECURERFCp0q2HJFGp7GD6jd+CKYQQt2X61dKuwWPtlvegatSowfHjx0tMy8/PJy0tDbVafd27PO7l3Sj/1Jf/zTffpGrVqjcdkkMIIcSj5ZYB1bJlS4qKili3bh1QPHTF7Nmz6du3L1qtlujoaHJyctDpdGzdupWgoKB/7CPftGlToqKigOI3Y6alpZXovngjkZGR5OfnM3Xq1DvdPyGEEA+pWwaUSqVi4cKFbNy4kS5duhAWFoZGo7H2nW/UqBFjx46lV69ehIWF0bBhQzw8PAgKCqJnz568//77Jdb3zDPPoCgK4eHhvPLKK7z33ns37a4OxV3cP/30U86ePUvfvn3p3bs3P/zwwz3uthBC/AuMOtte30Pmtp6Dupm1a9dy7Ngx3nrrrftZp/siPj6euquDS7saQoiH2V/uQd3oWZ0bPr9zP+9929A9MJPJdMcjQdyumz0HJUMdCSHEQ+6ll14iPT0dvV7P0KFDefrpp9m9ezeRkZGYzWY8PDz4+uuvKSgo4J133uHYsWMAjBkzhrCwMAIDA4mLiwNg48aN7Ny5k9mzZzNlyhQcHR2Jj48nKCiIHj16MGvWLPR6PU5OTrz77rtUq1YNs9nMRx99xG+//YZKpeKpp56iRo0afPvttyxatAiA6OhovvvuOxYuXHjb+3VPAdWvXz/69et3L6sQQghxj959913c3d3R6XQMGDCATp068eabb7J8+XIqVapETk4OAIsWLcLFxcXaD+Dq1VufoWVkZLBq1SrUajX5+fmsWLECe3t79u7dS2RkJPPnz2f16tVcvHiRdevWYW9vT05ODm5ubrz99ttkZ2fj6enJ2rVr7/jZKDmDEkKIh9y3337Lli1bAEhLS2P16tU0bdrU+kiOu7s7APv27WPu3LnW5dzcbn05smvXrqjVxWMC5uXl8dprr3H+/HlUKhVGo9G63oEDB1ovAV7bXu/evfnll1/o168fcXFx1/VJuBUJKCFEqTl40cyKo0YyCyxU87BjRKAjAR7X990qMCgsP2LkUJoZowVCqqh5uoEDTvYqMvItLD9iJP6yBYMZqrqrGNTAgbrl1aTmFZedumzBpECAu4pnGjpQqawdUadN7E8xk5ZvoayjiiBfNUMaOeDsePePypSGAwcOsHfvXlavXo1WqyUiIoK6dety7tzdjUX691dj/PU18Z988gktWrRg4cKFpKSkMHTo0H9cV79+/Rg1apR18IU7vYclASWEeOAUReGlDTo+jTXi7OxMxYoV+XF/ErP35LOguxMvNv2zZ++py2a6LC/kwlWFcuXKYW9vz1eH05mxW8/MDk6M2lBEnkGFr68vjo6OrIq/yKzfChjUwIGfTxkpMKrw8/NDrVbz3fGLvLO7APP/uoZptVr8/auQk5HD54cu8e4ePTuHOd8wJG1VXl4ebm5uaLVaEhISOHz4MHq9npiYGJKTk62X+Nzd3WnVqhUrVqywvgPq6tWruLm5Ua5cORISEggICGDr1q04OzvfdFvXXrnx008/Wae3atWK1atX06JFC+slPnd3d3x8fPD29mbx4sUsW7bsjvdNAkoI8cD9cMLEp7FGJk2axLRp03BxcSE1NZUXX3yRMeuj6BigppaXGkVReGZtEXonb/ZsWEPr1q0B2LJlCwMHDmTw2mxq1KjBwQ0bqFWrFgCXL19m2LBhrPjvf6lXrx5RUVHWZy0zMjIYPHgw27ZtY/bs2UyaNMl6+WrPnj306tWLF9bnsiXixgfoWzLq7m/Pu78MtXQz7dq1Y9WqVXTr1o2AgAAaN26Mp6cnM2bMYOzYsVgsFry8vPjqq68YNWoUM2bMoGfPntjZ2TFmzBi6dOnCxIkTGTlyJJ6enjRo0IDCwsIbbuu5555jypQpLF68mJCQEOv0J598kqSkJHr16oW9vT1PPfWUdcSg8PBwsrOzqV69+h3v/j11M7dl0s1cCNvV4ot8CtzrcuzYMX799VemTp3Kjz/+iKurKwEBAfynnpGFPbScuGSm/qICPv/8c55//nlCQ0Oxs7Nj8+bNfPLJJ4wfP57Zs2fz2muvERERwcGDBzl58iS7d+8mJCSE+fPnM2bMGPr168e5c+c4fPgwv/76K927d2f8+PEkJSVx4sQJ3n77bQYOHMibb77Je7PeIX+qK072qrvrZi5KmDFjBnXr1uXJJ5+86Tx3/boNIYS4nyyKwuF0Cz179gSKB5A+fPgwUVFReHt7ExwcTFx68YvtUnKL/36uU6cOUHwzft++fUDxX+aAdSi26tWrWwe2vtaN+trPmjVrXlf28ccfs27dOk6fPs2vv/4KgL29PWYFLI/kn+0PXr9+/Th16hS9e/e+q+XlEp8Q4oHKLlIwmMHPzw+AK1eulPjp5+fHpgMWzl2xoP3fESo6Opq2bduyaNEi63if15Zfu3Ytzz//PNOnTwfgwoUL1p5qq1atYvjw4dbeYwkJCcyfPx+A5wId+CLOSEBAANOnTyc5OZlFixYRUkVNGYeHq6OErVq7du09LS9nUEKIB8pNo8JOVXyvCKBMmTIA1hvzly9fJrNAofq8fNotK74XMn36dGbOnEn16tXRaDRkZWWRnp4OwIcffkjbtm3p0KEDFStWxNXV1Toc2rx582jevDnNmzenSpUqVKhQge+++w6AL+KMNGnShH379mE2m2nXrh2ZmZk809DhjvbnEb1L8sD8U/tJQAkhHigHtYpqHnZER0cD0L59e1QqFW3btkWn0xEbG4u7uztff/01L730EgBOTk7MmDGDtm3b8sUXX+Dl5WV939y1M6mkpCQyMzMpLCykYsWK15Wlp6ej0+msZV26dGHnzp3k5eUxaNAgFKW4l+CrW3QUGG4vdJycnMjKypKQukuKopCVlYWT0407gsglPiHEAzeyiQOvbtnGmjVrGD16NMOGDcPFxYX/+7//IzMzE19fX4YOHYqjoyOLFi2iY8eOLFu2jOzsbKpWrcqhQ4eYOXMmAEuWLKFbt27ExMSQn5+Pn5+f9XLf559/TkhICMePH0en0+Hl5WW93Dd8+HBcXFyKewEePAjA8uXLiYiIIC1foYbnrS/z+fv7k5KSwqVLl/6dhnoMODk54e/vf8My6cUnhHjgiowKIcsKOJhqoWXLltSqVYv9+/dz6tQpoPjdcK1atSIzM5MTJ07g7OxM+/bt8fHx4ezZs+zevRs/VxVP1Xcgcr8BPz8/goOD0Wg0/PHHHyXeYVe5cmWaN2+Og4MDcXFxnDx5EoB69erh7e1dol4ZGRkknoknbaIr7k637sUn/l0SUEKIUlFkVPgs1vC/kSQUqnnYMbKJIx0D1EzboefEZQuOauhWw54zWRb2ppjJLlLwdlYxoK4DI5s64qlVse2ciQUHDZy8bMFgVghwt2NwQweGNXZg01kTi2OMnM6yYLIUb2PoE8X3mL7+w4jBXLJOGjUMD3RkYIP/3YeSgCpVElBCCHEzElClSjpJCCGEsEkSUEIIIWySBJQQQgibJAElhBDCJklACSGEsEkSUEIIIWySBJQQQgibJAElhBDCJklACSGEsEmP7GCxisVyf1+9LIR4/NzGK9fFv+eRPYMymkx3tdzdjEp8/vz5B7Yt2Z5tbO9R3jfZ3l9IOJWqRzaghBBCPNwkoIQQQtgkCSghhBA2SQJKCCGETZKAEkIIYZMkoIQQQtgkCSghhBA2SQJKCCGETZKAEkIIYZMkoIQQQtgkCSghhBA2SQJKCCGETZKAEkIIYZMkoIQQQtgklaIoSmlX4t9w/PgJ6tevV9rVEEI8IuLj46lbt25pV+Ox8si+sNDOTkXVKRtKuxpCiEfEr8OqlXYVHjtyiU8IIYRNkoASQghhkySghBBC2CQJKCGEEDZJAkoIIYRNkoASQghhkx7ZbuZCiIePxVCE6WoGdpoyqF3Lo1KpbjifoihYinIxF+RgX7Y8dpoyJcsKczAX5mLv5oOdo1PJsoIczEW52Lv7YOfwZ5lFl48p7zKo7LB3LVdinaJ0SEAJIUqdRV9Izu6vyT+yFcWkB8CxQg3c2z+LtsoTJebVJR8je+tnGDMTiyeo7XGp3xH39s9iyEjgyrbPMV6+AIDK3hHnBp3waP8f9CnxZG9fgin7YnGZgwaXRl1wbtCJ7E0LMKSf/ctWVGirN8Wz8yjs3bz/9f0XN/bIjiQRHx9Pt6/PlXY1hBC3oCgKmd+/hSnlKP/5z38ICwsjNTWVBQsWcCbhHBUGf4CmYm0A9OlnSf92EtUDqvDSSy/h7+/P7t27Wbp0KTqdDlBRu3YtRo0aRYUKFdi+fTtfffUVRqMRUNGgQX1eeOEFvL292bRpE99++y0mk4myZcsydepUqlevjslk4vjx4yxcuJA8VRkqDl+ASu3Ar8OqyUgSD5jcgxJClCrd+T/QJcUxd+5clixZgoODA/379ycmJoYK3uXJ+W25dd6rvy2nYgVvYmJiCA8PJzU1lcjISJYsWQJAQEBVYmJi6Ny5M5mZmSxatIj58+cDUKdObQ4ePEi7du24fPkyX375JR988AEAXl5etGvXjpycHAICApg5cyZffvklpuyL6JKPP+gmEf8jASWEKFWFp/fh5ubGyJEj+f333+nTpw9jx46lbNmyjBw5El1SHBZ9IQD69DP07NkTd3d3Zs+ezYTJr7Njxw6GDBmCv78/ffr0wcXFhRkzZjBu4mT279/PiBEjKFeuHAMGDMDJyYk33niDsa9MJC4ujpdeeglXV1cSExNp1aoVzz//PN27dwegatWqQPG9KVE6JKCEEKXKdDWdGjVq4OjoyOnTpwE4e7b4flC9esUDPptyMwFQ2dlTWFgcVjVr1sTRTqFy5coA1KlTp0SZ1sEOPz8/7O3tqVmzZokyZ40Dvr6+aDQaqlUrHmOvatWqrFixgl27dnHp0iXGjx8PYL28KB48CSghROmyWHBwcACK70f99ee16ZfWzebSutmY87P48ccf2bt3L1OmTKGgoMB6puPg4MDy5cuJiYlh5syZ5ObmUqFCBQAcHR358ssvOXr0KHPnziUnJwd3d/cS2zCbzeTk5JCdnU358uUZNmwYAPrUUw+mHcR1JKCEEKXKvmx5zp0r7tDk7+8PgJ+fHwAJCQnF08vaU8UuCwCdTkfbtm0JDAykZcuW/PLLL5hMJn7//XcKCgoIDg4mKCiIFi1asHXrVvR6PbGxseTk5BAUFETTpk1p2rQpe/fuJS8vj6NHjwKQnJzM6NGjCQkJITU1lWHDhqFWq//sLSgeOOlmLoQoVU5VA8k8spk1a9bQp08fZs2aRVhYGGaz2dr5Yfny5QQHB2NvX3zIWrBgAYcOHaJhw4Y89dRTfPPNN2RlZeHo6EhkZCRxcXEEBQXRs2dPFi9eTH5+PmXLlmXmzJkcPXqU4OBgOnbsyJw5c9Dr9Tz77LM0b96ckydPUrduXSpWrMjhw4cxm82oy5YvzeZ5rElACSFKVZnarXAoX5Xnn3+eCxcu0LdvX1JTU+nWrZv1ntTvv/9Obm6udRlfX18mTpxIfn4+kyZNYt68eQBYLBYqV65Mp06duHr1KuPHj2fBggUAmEwmatasSVhYGFeuXOGll17is88+AyApKYnBgwfTsWNHrl69yuLFi3nvvfewK+NGmdqtH3CLiGvkOSghRKkz5WZyecPH6C8csU6zc3LFrfUgcg+uw/y/ThI3ZKfGuW47XAN7cGndu5jzs/9SZo9z/Q64NArl0k/vYim8+meZ2gGXhqHYu3kXd2W3mEus1tGnOl7dx+HoXdyJQp6DevAkoIQQNsNwKQnjpfPYaZzRVG6AnYMTitmEISMBFAsO5SqjUjuiTz+DOfcSKrUDGr+6qF08AFBMBvRppzHnXUZlrykucy7uDGEx6jCkny0uc3BC418PtbZscZkuH31GAub8bOw0ZbB3q1C8rb8MtSQB9eDJJT4hhM1wLF8Vx/JVS0xTqe2v6+rt5F/vhsur7B1xqtTghmV2Dk43L3NyuW5IJVH6pBefEEIImyQBJYQQwiZJQAkhhLBJElBCCCFskgSUEEIImyQBJYQQwibdVkClp6czatQounTpQmhoKO+88w4Gg+Gm8+fm5rJixQrr54yMDF5++eW7ruSIESPo1asXPXr04K233sJsNt96ISGEEA+1WwaUoiiMGTOG0NBQNm/ezKZNmygsLCQyMvKmy+Tm5rJy5UrrZx8fH+tQJHfjk08+4ZdffmH9+vVcuXKFjRs33vW6hBBCPBxu+aDu/v370Wg09O/fHwC1Ws3UqVPp1KkT/v7+7Nmzh/z8fDIyMujVqxdjxoxhzpw5XLhwgd69e9OqVSsGDx7Miy++yPr169Hr9UyfPp1jx46hVquZMmUKwcHBrF27lu3bt1NUVERycjKhoaFMnjwZABcXF6B4LC2j0Vji6W4hhBCPplsG1JkzZ6hfv36JaS4uLvj6+mI2mzl69ChRUVFotVoGDBhASEgIEydO5MyZM/z8888ApKSkWJe9dukvKiqKhIQERowYwaZNm4Di4YnWrVuHo6MjXbt2JSIiAl9fX6D4Mt+RI0do164dYWFht9wxRbGQNLvHbTaDEELchFEHDk7Ex8eXdk0eO/c81FGrVq3w8CgeB6tz587ExsYSGhp60/ljY2MZMmQIANWrV6dixYokJha/b6Vly5a4urpayy5evGgNqKVLl6LX65k0aRL79++ndet/HmFYpbKD6W73untCiMfd9Ku3nkf8K255D6pGjRocP368xLT8/HzS0tJQq9XXXW67l8tvjo6O1t/VavV1nSE0Gg2dOnVi27Ztd70NIYQQD4dbBlTLli0pKipi3bp1QPFrkWfPnk3fvn3RarVER0eTk5ODTqdj69atBAUF4ezsTEFBwQ3X17RpU6KiogBITEwkLS2NatWq3XT7BQUFZGYWD7VvMpnYuXPnP84vhBDi0XDLgFKpVCxcuJCNGzfSpUsXwsLC0Gg0TJgwAYBGjRoxduxYevXqRVhYGA0bNsTDw8P6Nsv333+/xPqeeeYZFEUhPDycV155hffee6/EmdPfFRUVMWrUKMLDw+nTpw9eXl4MHDjwHndbCCGErbun90GtXbuWY8eO8dZbb93POt0X8fHx1F0dXNrVEEI87P53Dyo+Pl7eB/WAyUgSQgghbNI99eLr168f/fr1u191EUIIIazkDEoIIYRNkle+CyFKzf4UE5H7DcSmmtE6qOhZ057xwY74uJT821lnUph/wMD6MyYu5lqo6m7Hc0GOPFXfnitFCnP2GdiWaOJSgYKnVkW7KvZMauWIq6OKufsMbD5nIrNAoaanHS82daBbDXsWHjSw4YyJxCsWAPzL2tGnjj2jmjqisZfRamzBPXWSsGXSSUII2/bFIQPPR+nw8vKiS5cuXLlyhS1btlDOyUL0cGeqexaHVJFRIWRZAQdTLbRo0YJq1aoRGxvL6dOn6VbDntNZZpJy7WjXrh1+fn5kZmayc+dO7CwG7O2gwKiidevWVKpUib1793L+/HmcHaDACE2aNKFWrVqoVCpOnTpFbGwsnQLUbI4og921Zzqlk0SpkTMoIcQDl12k8MomHV26dOGnn37Czs4OJycnjh8/TuvWrZmyrYAfniwDwEd7DcSkKfzwww/069ePc+fOUaNGDd5++22mT5+OnZ0d0dF7CA4OJj4+ntq1a3Pq1CmaNGmCzmhk69ZNtG3bluTkZKpUqcL48eNZsGABWq2WmJgYsrKycHV1xdHRkSVLlvDCCy+wJcFMWA05PJY2uQclhHjgvjtqJN8Ac+bMwWQy4efnx5AhQ6hfvz5jxozhxxMmMguKL739etZEmzZtGDBgAJ988gk1a9Zkw4YNTJ06lUqVKuHv709wcDA//fQTv4wNZOnSpdStW5e6devSu3dvOnbsyLRp06hevToxMTG8++67uLm5odfrqVy5MuXKlcPf3x+dTmcdFPvEJXmljy2QgPr/9u4+KKp6DwP4s7yJXkHFF3CCYKbx6nAlF8vCVFReJFl2gVHMEmWKtJobFpNZNtjVqXFqrBmVRhT/8I6KzSCBjQplK23YhQm7I6JoLXYFAeXFi7YIusLu9/7BsOnNal08cLDn8w+we5bnnLO759mz5+z+iGjA/XDFhtGjR2Pq1Kk4deoU2tvb8fXXXwMAZs+eDQAw/7e3oH62CsaNGwcA6OjoANA7pI+npydmzJiBxsZGfPtt7x5U67S/Izo6GtXV1Th79qzjdhaLBUDv17T5+PggLCwMdrsdbW1tWLx4MV5++WV4e3tj3759AIDJ47hpVAPuwxLRgGu/8UvpdHV13fGz7/Id33ej/prAXQOYTCY0NzcjMzMT06ZNg06nc0xrt9thMpnw5JNPYsWKFQgMDMTOnTvR3d2N4uJiWCwWbNy4EU8//TSio6MBAOPHjwcAjBo1Clu2bMG4ceNw/fp1mM1mAEA3d6BUgS8TiGjATRzphkuXLgGAYzSEvp99l+ed7kZq0Q2cbrXj6tWreOyxx/Dhhx/ixx9/dAyIajabMWvWLGRlZSE7OxuPT9diQ63ACAAACfRJREFUw4YNyMjIgE6nQ0NDA8LDw5GdnY3q6moUFRU5bgf0jvYdGBiICRMmoLW1FVu3boWfnx++/KlnQNcH3R0LiogGXPhEN3R1dcFoNEKr1WLu3LlIT08HAMc4ctnZ2TCbzfDz8wMAhIaGYteuXSgqKsK8efPQ0NCA48ePo+9E5KCgIAwb/hcEBwcDgOPySZMmYfv27TAajYiMjER1dTVqamowffp0GAwGhISEQKvVwtfXFyICq9UKb55mrgp8i4+IBlxKqCfWHbNi9erVyM/Ph8lkgs1mw969e7Fnzx4AQEBAACZNmgR3d3cAwM6dOx0jGZw8eRIrV66EzWZDRUUFdu3ahfT0dKSkpAAADhw4gJKSEgBAQUGBY1Tu8vJyPP/88wCAhx56CIWFhY7/f+XKFaxatQqdnZ3QTRoxcCuDfhM/B0VEg6KsvgeGT7vws7V3gFKLxYK2tjY86u+G6hY7vLy84Obmhps3bwIAPD09ERISgu7ubtTV1cF3GPDPxOHYVnkLpjobfH194e/vj7a2Nly7ds2R4+3tjYcffhhdXV1obGzE+BEaBPpqcLLZjhEjRiAoKAidnZ1obm5GT08P/jHXCxvmef8yo/wc1KBhQRHRoGm/Idh98ha+v2zDCA8NEv7qAcNkD5xptePA2W5024CpE9wwZZw7in7oxn+u2uHhpkFEoDtSH/XEaG8NbHbBIXMPSi/Y0Nppx9jhGsx+2AOLQz3w78s2FJ7rQf3Pdgxz1yAy2B1Lp3pihCdQdK4H/2qwoanDjuEeGoSM1mDJ3zwROt79zplkQQ0aFhQR0e9hQQ0aniRBRESqxIIiIiJVYkEREZEqsaCIiEiVWFBERKRKLCgiIlIlFhQREakSC4qIiFSJBUVERKr0wH5ZrNjtjk+AExG5rPsm4On9x9PRfffA7kF197g2nktbW9s936a+vn7AspinjrwHedmY939YToPmgS0oIiIa2lhQRESkSiwoIiJSJRYUERGpEguKiIhUiQVFRESqxIIiIiJVYkEREZEqsaCIiEiVNCIigz0TSqiqqsKwYcMGezaI6AFhtVqh1WoHezb+VB7YgiIioqGNb/EREZEqsaCIiEiVWFBERKRKLCgiIlIlFhQREakSC4qIiFRpSBdUWVkZ4uLiEBsbi9zc3F9df+vWLbz++uuIjY1FSkoKGhsbFc07ceIEkpOTERoaii+++KJfWc7k7d69G/Hx8dDr9UhLS0NTU5OieZ9++in0ej0SExPx7LPP4vz584rm9fnyyy8xefJknD59WrGswsJCREREIDExEYmJiThw4IDLWc7kAUBxcTHi4+Oh0+nwxhtvKJq3adMmx7LFxcXh8ccfVzTv0qVLWL58OZKSkqDX6/HNN98oltXU1IS0tDTo9XosX74czc3NLmcBwLp16zBz5kwkJCTc9XoRwfvvv4/Y2Fjo9XrU1NT0K49+hwxRPT09Eh0dLRcvXhSr1Sp6vV5qa2vvmGbfvn2yfv16ERE5fPiwvPbaa4rmNTQ0yLlz5+TNN9+UkpISl7OczauoqJCuri4REcnLy1N8+To6Ohy/G41GeeGFFxTN68t87rnnJCUlRaqrqxXL+uyzz2Tjxo0u/X9X8i5cuCCJiYly7do1ERG5cuWKonm327Nnj7z99tuK5mVlZUleXp6IiNTW1sr8+fMVy8rIyJDCwkIRESkvL5c1a9a4lNWnsrJSzpw5Izqd7q7Xm0wmSU9PF7vdLidPnpTFixf3K49+25Ddg6qurkZwcDCCgoLg5eUFnU6HY8eO3TFNaWkpkpOTAQBxcXGoqKiAuPi5ZGfyAgMDMWXKFLi59X+1OpMXERGB4cOHAwC0Wm2/Xjk6kzdy5EjH7zdu3IBGo1E0DwC2bt2KlStX9utbQZzNul+cycvPz8eyZcswatQoAMDYsWMVzbvdkSNHfnPv4H7laTQaXL9+HQDQ0dGBCRMmKJb1008/ISIiAkDvc6K/9+2MGTMc98vdHDt2DElJSdBoNNBqtbBYLGhtbe1XJt3dkC2olpYWBAQEOP729/dHS0vLr6aZOHEiAMDDwwM+Pj64evWqYnn3073mFRQUIDIyUvG8vLw8xMTEYPPmzcjKylI0r6amBs3NzZg3b57LOc5mAcDRo0eh1+uxevVqXL58WdG8uro6XLhwAUuXLsWSJUtQVlamaF6fpqYmNDY2OjboSuW9+uqrOHToECIjI7Fq1SqXHyvOZE2ZMgVHjx4FAHz11Vfo7Ox0+XnuyjwFBAQoui34MxuyBUW/+Pzzz3HmzBm8+OKLimctW7YMRqMRa9asQU5OjmI5drsdH3zwAd566y3FMm43f/58lJaW4tChQ3jqqacUz7XZbKivr8fevXvx8ccfY/369bBYLIpmAr17T3FxcXB3d1c8Jzk5GWVlZcjNzcXatWtht9sVyVq7di1OnDiBpKQkVFZWwt/fX/Hlo4ExZAvK39//jre0Wlpa4O/v/6tp+l4J9/T0oKOjA2PGjFEs735yNq+8vBw7duxATk4OvLy8FM/ro9PpYDQaFcvr7OyE2WzGihUrEBUVhaqqKrzyyisunSjhzLKNGTPGsf5SUlL6deDb2cdmVFQUPD09ERQUhJCQENTV1SmW16e4uBg6nc6lnHvJKygowMKFCwEA4eHhsFqtLu3VOLsuP/nkExw8eBCZmZkAAF9f33vOcnWempubFd0W/JkN2YIKCwtDXV0dGhoacOvWLRw5cgRRUVF3TBMVFYWioiIAvWeCRUREuHzcxJm8+8mZvLNnz+Ldd99FTk5Ov45hOJt3+wbUZDIhODhYsTwfHx989913KC0tRWlpKbRaLXJychAWFqbIst1+DKG0tBSPPPKIYssGADExMaisrAQAtLe3o66uDkFBQYrlAb3HaiwWC8LDw13KuZe8iRMnoqKiwpFrtVrh5+enSFZ7e7tj7yw3NxeLFi1yccmcExUVhYMHD0JEUFVVBR8fH5ePsdEfGOyzNPrDZDLJggULJDo6WrZv3y4iIlu2bBGj0SgiIjdv3pSMjAyJiYmRRYsWycWLFxXNO3XqlMyZM0emTZsmTzzxhMTHxyual5aWJjNnzhSDwSAGg0FeeuklRfPee+89iY+PF4PBIKmpqWI2mxXNu11qaqrLZ/E5k/XRRx9JfHy86PV6SU1NlfPnz7uc5Uye3W6XTZs2ycKFCyUhIUEOHz6saJ6IyLZt22Tz5s39ynE2r7a2Vp555hnR6/ViMBjk+PHjimWVlJRIbGysLFiwQN555x2xWq39WrbMzEyZNWuWhIaGypw5cyQ/P1/2798v+/fvF5He+27Dhg0SHR0tCQkJ/Xpc0u/jcBtERKRKQ/YtPiIierCxoIiISJVYUEREpEosKCIiUiUWFBERqRILioiIVIkFRUREqvQ/iOUlr4XY7mYAAAAASUVORK5CYII=\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot learning curves for the different models\n",
"fig = plt.figure(figsize=(10,6))\n",
"sns.set_style(style='dark')\n",
"ax = sns.lineplot(x='epoch', y='loss',\n",
" style='type',\n",
" hue='model',\n",
" data=learning_curves)\n",
"ax.set_title('Learning Curves', fontdict={'fontsize': 16})"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"fig.savefig('./visualizations/custom_learning_curve.png')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
================================================
FILE: examples/titanic/multiple_model_training.py
================================================
#!/usr/bin/env python
# # Multiple Model Training Example
#
# This example trains multiple models and extracts training statistics
import logging
import shutil
# ## Import required libraries
from ludwig.api import LudwigModel
from ludwig.datasets import titanic
from ludwig.visualize import learning_curves
# clean out old results
shutil.rmtree("./results", ignore_errors=True)
shutil.rmtree("./visualizations", ignore_errors=True)
# list models to train
list_of_model_ids = ["model1", "model2"]
list_of_train_stats = []
training_set, _, _ = titanic.load(split=True)
# ## Train models
for model_id in list_of_model_ids:
print(">>>> training: ", model_id)
# Define Ludwig model object that drive model training
model = LudwigModel(config="./" + model_id + "_config.yaml", logging_level=logging.WARN)
# initiate model training
train_stats, _, _ = model.train(
dataset=training_set, experiment_name="multiple_model_experiment", model_name=model_id
)
# save training stats for later use
list_of_train_stats.append(train_stats)
print(">>>>>>> completed: ", model_id, "\n")
# generating learning curves from training
learning_curves(
list_of_train_stats,
"Survived",
model_names=list_of_model_ids,
output_directory="./visualizations",
file_format="png",
)
================================================
FILE: examples/titanic/simple_model_training.py
================================================
#!/usr/bin/env python
# # Simple Model Training Example
#
# This example is the API example for this Ludwig command line example
# (https://ludwig-ai.github.io/ludwig-docs/latest/examples/titanic/).
# Import required libraries
import logging
import os
import shutil
import yaml
from ludwig.api import LudwigModel
from ludwig.datasets import titanic
# clean out prior results
shutil.rmtree("./results", ignore_errors=True)
# Download and prepare the dataset
training_set, test_set, _ = titanic.load(split=True)
config = yaml.safe_load("""
input_features:
- name: Pclass
type: category
- name: Sex
type: category
- name: Age
type: number
preprocessing:
missing_value_strategy: fill_with_mean
- name: SibSp
type: number
- name: Parch
type: number
- name: Fare
type: number
preprocessing:
missing_value_strategy: fill_with_mean
- name: Embarked
type: category
output_features:
- name: Survived
type: binary
""")
# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)
# initiate model training
(
train_stats, # dictionary containing training statistics
preprocessed_data, # tuple Ludwig Dataset objects of pre-processed training data
output_directory, # location of training results stored on disk
) = model.train(
dataset=training_set, experiment_name="simple_experiment", model_name="simple_model", skip_save_processed_input=True
)
# list contents of output directory
print("contents of output directory:", output_directory)
for item in os.listdir(output_directory):
print("\t", item)
# batch prediction
model.predict(test_set, skip_save_predictions=False)
================================================
FILE: examples/twitter_bots/README.md
================================================
# Twitter Bots Example
We'll be using the twitter human-bots dataset which is composed of 37438 rows each corresponding to a Twitter user
account. Each row contains 20 feature columns collected via the Twitter API. These features contain multiple data
modalities, including the account description and the profile image.
The target column account_type has two unique values: bot or human. 25013 user accounts were annotated as human
accounts, the remaining 12425 are bots.
### Preparatory Steps
Create and download your [Kaggle API Credentials](https://github.com/Kaggle/kaggle-api#api-credentials).
The Twitter Bots dataset is hosted by Kaggle, Ludwig will need to authenticate you through the Kaggle API to download
the dataset.
### Examples
Run `python train_twitter_bots.py` to train a single model.
For a faster, more lightweight model run `python train_twitter_bots_text_only.py`, which does not use image features.
This will download the Twitter Bots dataset into the current
directory, train a model, and write results into the following directories:
```
./outputs/results/
api_experiment_run/
./outputs/visualizations/
confusion_matrix__account_type_top2.png
confusion_matrix_entropy__account_type_top2.png
learning_curves_account_type_accuracy.png
learning_curves_account_type_loss.png
```
After training, the script will generate the following plots:



================================================
FILE: examples/twitter_bots/train_twitter_bots.py
================================================
#!/usr/bin/env python
"""Trains model on Twitter Bots dataset using default settings."""
import logging
import os
import shutil
import yaml
from ludwig import datasets
from ludwig.api import LudwigModel
from ludwig.utils.fs_utils import rename
from ludwig.visualize import confusion_matrix, learning_curves
if __name__ == "__main__":
# Cleans out prior results
results_dir = os.path.join("outputs", "results")
visualizations_dir = os.path.join("outputs", "visualizations")
shutil.rmtree(results_dir, ignore_errors=True)
shutil.rmtree(visualizations_dir, ignore_errors=True)
# Loads the dataset
twitter_bots_dataset = datasets.get_dataset("twitter_bots", cache_dir="downloads")
training_set, val_set, test_set = twitter_bots_dataset.load(split=True)
# Moves profile images into local directory, so relative paths in the dataset will be resolved.
if not os.path.exists("profile_images"):
rename(os.path.join(twitter_bots_dataset.processed_dataset_dir, "profile_images"), "profile_images")
config = yaml.safe_load("""
input_features:
- name: default_profile
type: binary
- name: default_profile_image
type: binary
- name: description
type: text
- name: favourites_count
type: number
- name: followers_count
type: number
- name: friends_count
type: number
- name: geo_enabled
type: binary
- name: lang
type: category
- name: location
type: category
- name: profile_background_image_path
type: category
- name: profile_image_path
type: image
preprocessing:
num_channels: 3
- name: statuses_count
type: number
- name: verified
type: binary
- name: average_tweets_per_day
type: number
- name: account_age_days
type: number
output_features:
- name: account_type
type: binary
""")
model = LudwigModel(config, logging_level=logging.INFO)
train_stats, preprocessed_data, output_directory = model.train(dataset=training_set, output_directory=results_dir)
# Generates predictions and performance statistics for the test set.
test_stats, predictions, output_directory = model.evaluate(
test_set, collect_predictions=True, collect_overall_stats=True, output_directory=results_dir
)
confusion_matrix(
[test_stats],
model.training_set_metadata,
"account_type",
top_n_classes=[2],
model_names=[""],
normalize=True,
output_directory=visualizations_dir,
file_format="png",
)
# Visualizes learning curves, which show how performance metrics changed over time during training.
learning_curves(
train_stats, output_feature_name="account_type", output_directory=visualizations_dir, file_format="png"
)
================================================
FILE: examples/twitter_bots/train_twitter_bots_text_only.py
================================================
#!/usr/bin/env python
"""Trains twitter bots using tabular and text features only, no images."""
import logging
import os
import shutil
import yaml
from ludwig.api import LudwigModel
from ludwig.datasets import twitter_bots
from ludwig.visualize import confusion_matrix, learning_curves
if __name__ == "__main__":
# Cleans out prior results
results_dir = os.path.join("outputs", "results")
visualizations_dir = os.path.join("outputs", "visualizations")
shutil.rmtree(results_dir, ignore_errors=True)
shutil.rmtree(visualizations_dir, ignore_errors=True)
# Loads the dataset
training_set, val_set, test_set = twitter_bots.load(split=True)
config = yaml.safe_load("""
input_features:
- name: created_at
type: date
column: created_at
- name: default_profile
type: binary
column: default_profile
- name: description
type: text
column: description
- name: favourites_count
type: number
column: favourites_count
- name: followers_count
type: number
column: followers_count
- name: friends_count
type: number
column: friends_count
- name: geo_enabled
type: binary
column: geo_enabled
- name: lang
type: category
column: lang
- name: location
type: text
column: location
- name: screen_name
type: text
column: screen_name
- name: statuses_count
type: number
column: statuses_count
- name: verified
type: binary
column: verified
- name: average_tweets_per_day
type: number
column: average_tweets_per_day
- name: account_age_days
type: number
column: account_age_days
output_features:
- name: account_type
type: category
column: account_type
trainer:
batch_size: 16
defaults:
text:
preprocessing:
tokenizer: space_punct
max_sequence_length: 16
model_type: ecd
""")
model = LudwigModel(config, logging_level=logging.INFO)
train_stats, preprocessed_data, output_directory = model.train(dataset=training_set, output_directory=results_dir)
# Generates predictions and performance statistics for the test set.
test_stats, predictions, output_directory = model.evaluate(
test_set, collect_predictions=True, collect_overall_stats=True, output_directory=results_dir
)
confusion_matrix(
[test_stats],
model.training_set_metadata,
"account_type",
top_n_classes=[2],
model_names=[""],
normalize=True,
output_directory=visualizations_dir,
file_format="png",
)
# Visualizes learning curves, which show how performance metrics changed over time during training.
learning_curves(
train_stats, output_feature_name="account_type", output_directory=visualizations_dir, file_format="png"
)
================================================
FILE: examples/wine_quality/README.md
================================================
# Ludwig Defaults Config Section Example
Demonstrates how to use Ludwig's defaults section introduced in v0.6.
### Preparatory Steps
- Create `data` directory
- Download [Kaggle wine quality data set](https://www.kaggle.com/rajyellow46/wine-quality) into the `data` directory. Directory should
appear as follows:
```
wine_quality/
data/
winequalityN.csv
```
### Description
Jupyter notebook `model_defaults_example.ipynb` demonstrates how to use the defaults section of Ludwig.
Key features demonstrated in the notebook:
- Training data is prepared for use
- Programmatically create Ludwig config dictionary from the training data dataframe
- How to define preprocessing, encoder, decoder and loss sub-sections under the defaults section
================================================
FILE: examples/wine_quality/model_defaults_example.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd \n",
"import numpy as np\n",
"\n",
"import os\n",
"\n",
"import shutil\n",
"from pprint import pprint\n",
"import logging\n",
"\n",
"from ludwig.api import LudwigModel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Receive data for training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df = pd.read_csv('./data/winequalityN.csv')\n",
"train_df['quality'] = train_df['quality'].apply(str)\n",
"train_df.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Replace white space in column names with underscore\n",
"new_col = []\n",
"for i in range(len(train_df.columns)):\n",
" new_col.append(train_df.columns[i].replace(' ', '_'))\n",
" \n",
"train_df.columns = new_col"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.describe().T"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df['quality'].value_counts().sort_index()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cols = list(set(train_df.columns) - set(['quality']))\n",
"features = train_df[cols]\n",
"\n",
"#extract categorical features\n",
"categorical_features = []\n",
"for p in features:\n",
" if train_df[p].dtype == 'object':\n",
" categorical_features.append(p)\n",
" \n",
"print(\"categorical features:\", categorical_features, '\\n')\n",
"\n",
"# get numerical features\n",
"numerical_features = list(set(features) - set(categorical_features))\n",
"\n",
"print(\"numerical features:\", numerical_features, \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for feature in categorical_features:\n",
" print(f\"# of distinct values in categorical feature '{feature}' : {train_df[feature].nunique()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Ludwig Config"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# template for config\n",
"config = {'input_features':[], 'output_features': [], 'trainer':{}}\n",
"\n",
"# setup input features for categorical features\n",
"for p in categorical_features:\n",
" a_feature = {\n",
" 'name': p.replace(' ','_'), \n",
" 'type': 'category'\n",
" }\n",
" config['input_features'].append(a_feature)\n",
"\n",
"# setup input features for numerical features\n",
"for p in numerical_features:\n",
" a_feature = {\n",
" 'name': p.replace(' ', '_'), \n",
" 'type': 'number'\n",
" }\n",
" config['input_features'].append(a_feature)\n",
"\n",
"# set up output variable\n",
"config['output_features'].append({'name': 'quality', 'type':'category'})\n",
"\n",
"# set default preprocessing and encoder for numerical features\n",
"config['defaults'] = {\n",
" 'number': {\n",
" 'preprocessing': {\n",
" 'missing_value_strategy': 'fill_with_mean', \n",
" 'normalization': 'zscore'\n",
" },\n",
" 'encoder': {\n",
" 'type': 'dense',\n",
" 'num_layers': 2\n",
" },\n",
" },\n",
" 'category': {\n",
" 'encoder': {\n",
" 'type': 'sparse'\n",
" },\n",
" 'decoder': {\n",
" 'top_k': 2\n",
" },\n",
" 'loss': {\n",
" 'confidence_penalty': 0.1 \n",
" }\n",
" }\n",
"}\n",
"\n",
"# set up trainer\n",
"config['trainer'] = {'epochs': 5}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pprint(config, indent=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize and Train LudwigModel"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = LudwigModel(config, backend = 'local', logging_level = logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Inspecting Config After Model Initialization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pprint(model.config['input_features'], indent=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pprint(model.config['output_features'], indent=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"eval_stats, train_stats, _, _ = model.experiment(\n",
" dataset = train_df,\n",
" experiment_name = 'wine_quality'\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cleanup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" shutil.rmtree('./results')\n",
" items = os.listdir('./')\n",
" for item in items:\n",
" if item.endswith(\".hdf5\") or item.endswith(\".json\") or item == '.lock_preprocessing':\n",
" os.remove(os.path.join('./', item))\n",
"except Exception as e:\n",
" pass "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.13 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: examples/wmt15/config_large.yaml
================================================
input_features:
- name: en
type: text
encoder: bert
pretrained_model_name_or_path: bert-base-uncased
output_features:
- name: fr
type: text
tokenizer: french_tokenize
================================================
FILE: examples/wmt15/config_small.yaml
================================================
input_features:
- name: en
type: text
encoder: embed
output_features:
- name: fr
type: text
================================================
FILE: examples/wmt15/train_nmt.py
================================================
"""Sample ludwig training code for training an NMT model (en -> fr) on WMT15 (https://www.statmt.org/wmt15/).
The dataset is rather large (8GB), which can take several minutes to preprocess.
"""
import logging
import shutil
from ludwig.api import LudwigModel
from ludwig.datasets import wmt15
# clean out prior results
shutil.rmtree("./results", ignore_errors=True)
# Download and prepare the dataset
training_set = wmt15.load()
model = LudwigModel(config="./config_small.yaml", logging_level=logging.INFO)
(
train_stats, # dictionary containing training statistics
preprocessed_data, # tuple Ludwig Dataset objects of pre-processed training data
output_directory, # location of training results stored on disk
) = model.train(dataset=training_set, experiment_name="simple_experiment", model_name="simple_model")
================================================
FILE: ludwig/__init__.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import sys
from ludwig.globals import LUDWIG_VERSION as __version__ # noqa
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format="%(message)s")
# Disable annoying message about NUMEXPR_MAX_THREADS
logging.getLogger("numexpr").setLevel(logging.WARNING)
================================================
FILE: ludwig/accounting/__init__.py
================================================
================================================
FILE: ludwig/accounting/used_tokens.py
================================================
import torch
def get_used_tokens_for_ecd(inputs: dict[str, torch.Tensor], targets: dict[str, torch.Tensor]) -> int:
"""Returns the number of used tokens for an ECD model.
The number of used tokens is the total size of the input and output tensors, which corresponds to 1 token for
binary, category, and number features, and variable number of tokens for text features, for each example in the
batch.
Args:
inputs: The input tensors for one forward pass through ecd.
targets: The target tensors for one forward pass through ecd.
"""
used_tokens = 0
for input_feature_tensor in inputs.values():
used_tokens += torch.flatten(input_feature_tensor).shape[0]
if targets is not None:
# targets may be None for evaluation.
for output_feature_tensor in targets.values():
used_tokens += torch.flatten(output_feature_tensor).shape[0]
return used_tokens
def get_used_tokens_for_llm(model_inputs: torch.Tensor, tokenizer) -> int:
"""Returns the number of used tokens for an LLM model.
Args:
model_inputs: torch.Tensor with the merged input and target IDs.
tokenizer: The tokenizer used to encode the inputs.
Returns:
The total number of non-pad tokens, for all examples in the batch.
"""
return torch.sum(model_inputs != tokenizer.pad_token_id).item()
================================================
FILE: ludwig/api.py
================================================
# !/usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
File name: LudwigModel.py
Author: Piero Molino
Date created: 5/21/2019
Python Version: 3+
"""
import copy
import dataclasses
import logging
import os
import sys
import tempfile
import time
import traceback
from collections import OrderedDict
from dataclasses import dataclass
from pprint import pformat
from typing import Any, ClassVar
import numpy as np
import pandas as pd
import torch
from tabulate import tabulate
from ludwig.api_annotations import PublicAPI
from ludwig.backend import Backend, initialize_backend, provision_preprocessing_workers
from ludwig.callbacks import Callback
from ludwig.constants import (
AUTO,
BATCH_SIZE,
EVAL_BATCH_SIZE,
FALLBACK_BATCH_SIZE,
FULL,
HYPEROPT,
HYPEROPT_WARNING,
MIN_DATASET_SPLIT_ROWS,
MODEL_ECD,
MODEL_LLM,
TEST,
TIMESERIES,
TRAINING,
VALIDATION,
)
from ludwig.data.cache.types import CacheableDataset
from ludwig.data.dataset.base import Dataset
from ludwig.data.postprocessing import convert_predictions, postprocess
from ludwig.data.preprocessing import load_metadata, preprocess_for_prediction, preprocess_for_training
from ludwig.datasets import load_dataset_uris
from ludwig.features.feature_registries import update_config_with_metadata, update_config_with_model
from ludwig.globals import (
LUDWIG_VERSION,
MODEL_FILE_NAME,
MODEL_HYPERPARAMETERS_FILE_NAME,
MODEL_WEIGHTS_FILE_NAME,
set_disable_progressbar,
TRAIN_SET_METADATA_FILE_NAME,
TRAINING_CHECKPOINTS_DIR_PATH,
)
from ludwig.models.base import BaseModel
from ludwig.models.calibrator import Calibrator
from ludwig.models.inference import InferenceModule, save_ludwig_model_for_inference
from ludwig.models.predictor import (
calculate_overall_stats,
print_evaluation_stats,
save_evaluation_stats,
save_prediction_outputs,
)
from ludwig.models.registry import model_type_registry
from ludwig.schema.model_config import ModelConfig
from ludwig.types import ModelConfigDict, TrainingSetMetadataDict
from ludwig.upload import get_upload_registry
from ludwig.utils import metric_utils
from ludwig.utils.backward_compatibility import upgrade_config_dict_to_latest_version
from ludwig.utils.config_utils import get_preprocessing_params
from ludwig.utils.data_utils import (
figure_data_format,
generate_kfold_splits,
load_dataset,
load_json,
load_yaml,
save_json,
)
from ludwig.utils.dataset_utils import generate_dataset_statistics
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.fs_utils import makedirs, path_exists, upload_output_directory
from ludwig.utils.heuristics import get_auto_learning_rate
from ludwig.utils.llm_utils import create_text_streamer, TextStreamer
from ludwig.utils.misc_utils import (
get_commit_hash,
get_file_names,
get_from_registry,
get_output_directory,
set_saved_weights_in_checkpoint_flag,
)
from ludwig.utils.print_utils import print_boxed
from ludwig.utils.tokenizers import HFTokenizer
from ludwig.utils.torch_utils import DEVICE
from ludwig.utils.trainer_utils import get_training_report
from ludwig.utils.types import DataFrame, TorchDevice
from ludwig.utils.upload_utils import HuggingFaceHub
logger = logging.getLogger(__name__)
@PublicAPI
@dataclass
class EvaluationFrequency: # noqa F821
"""Represents the frequency of periodic evaluation of a metric during training. For example:
"every epoch"
frequency: 1, period: EPOCH
"every 50 steps".
frequency: 50, period: STEP
"""
frequency: float = 1.0
period: str = "epoch" # One of "epoch" or "step".
EPOCH: ClassVar[str] = "epoch" # One epoch is a single pass through the training set.
STEP: ClassVar[str] = "step" # One step is training on one mini-batch.
@PublicAPI
@dataclass
class TrainingStats: # noqa F821
"""Training stats were previously represented as a tuple or a dict.
This class replaces those while preserving dict and tuple-like behavior (unpacking, [] access).
"""
training: dict[str, Any]
validation: dict[str, Any]
test: dict[str, Any]
evaluation_frequency: EvaluationFrequency = dataclasses.field(default_factory=EvaluationFrequency)
# TODO(daniel): deprecate multiple return value unpacking and dictionary-style element access
def __iter__(self):
return iter((self.training, self.test, self.validation))
def __contains__(self, key):
return (
(key == TRAINING and self.training)
or (key == VALIDATION and self.validation)
or (key == TEST and self.test)
)
def __getitem__(self, key):
# Supports dict-style [] element access for compatibility.
return {TRAINING: self.training, VALIDATION: self.validation, TEST: self.test}[key]
@PublicAPI
@dataclass
class PreprocessedDataset: # noqa F821
training_set: Dataset
validation_set: Dataset
test_set: Dataset
training_set_metadata: TrainingSetMetadataDict
# TODO(daniel): deprecate multiple return value unpacking and indexed access
def __iter__(self):
return iter((self.training_set, self.validation_set, self.test_set, self.training_set_metadata))
def __getitem__(self, index):
return (self.training_set, self.validation_set, self.test_set, self.training_set_metadata)[index]
@PublicAPI
@dataclass
class TrainingResults: # noqa F821
train_stats: TrainingStats
preprocessed_data: PreprocessedDataset
output_directory: str
def __iter__(self):
"""Supports tuple-style return value unpacking ex.
train_stats, training_set, output_dir = model.train(...)
"""
return iter((self.train_stats, self.preprocessed_data, self.output_directory))
def __getitem__(self, index):
"""Provides indexed getter ex.
train_stats = model.train(...)[0]
"""
return (self.train_stats, self.preprocessed_data, self.output_directory)[index]
@PublicAPI
class LudwigModel:
"""Class that allows access to high level Ludwig functionalities.
# Inputs
:param config: (Union[str, dict]) in-memory representation of
config or string path to a YAML config file.
:param logging_level: (int) Log level that will be sent to stderr.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param gpus: (Union[str, int, List[int]], default: `None`) GPUs
to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow Torch
to use multithreading parallelism to improve performance at the
cost of determinism.
# Example usage:
```python
from ludwig.api import LudwigModel
```
Train a model:
```python
config = {...}
ludwig_model = LudwigModel(config)
train_stats, _, _ = ludwig_model.train(dataset=file_path)
```
or
```python
train_stats, _, _ = ludwig_model.train(dataset=dataframe)
```
If you have already trained a model you can load it and use it to predict
```python
ludwig_model = LudwigModel.load(model_dir)
```
Predict:
```python
predictions, _ = ludwig_model.predict(dataset=file_path)
```
or
```python
predictions, _ = ludwig_model.predict(dataset=dataframe)
```
Evaluation:
```python
eval_stats, _, _ = ludwig_model.evaluate(dataset=file_path)
```
or
```python
eval_stats, _, _ = ludwig_model.evaluate(dataset=dataframe)
```
"""
def __init__(
self,
config: str | dict,
logging_level: int = logging.ERROR,
backend: Backend | str | None = None,
gpus: str | int | list[int] | None = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
callbacks: list[Callback] | None = None,
) -> None:
"""Constructor for the Ludwig Model class.
# Inputs
:param config: (Union[str, dict]) in-memory representation of
config or string path to a YAML config file.
:param logging_level: (int) Log level that will be sent to stderr.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param gpus: (Union[str, int, List[int]], default: `None`) GPUs
to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow Torch
to use multithreading parallelism to improve performance at the
cost of determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
# Return
:return: (None) `None`
"""
# check if config is a path or a dict
if isinstance(config, str): # assume path
config_dict = load_yaml(config)
self.config_fp = config
else:
config_dict = copy.deepcopy(config)
self.config_fp = None # type: ignore [assignment]
self._user_config = upgrade_config_dict_to_latest_version(config_dict)
# Initialize the config object
self.config_obj = ModelConfig.from_dict(self._user_config)
# setup logging
self.set_logging_level(logging_level)
# setup Backend
self.backend = initialize_backend(backend or self._user_config.get("backend"))
self.callbacks = callbacks if callbacks is not None else []
# setup PyTorch env (GPU allocation, etc.)
self.backend.initialize_pytorch(
gpus=gpus, gpu_memory_limit=gpu_memory_limit, allow_parallel_threads=allow_parallel_threads
)
# setup model
self.model = None
self.training_set_metadata: dict[str, dict] | None = None
# online training state
self._online_trainer = None
# Zero-shot LLM usage.
if (
self.config_obj.model_type == MODEL_LLM
and self.config_obj.trainer.type == "none"
# Category output features require a vocabulary. The LLM LudwigModel should be initialized with
# model.train(dataset).
and self.config_obj.output_features[0].type == "text"
):
self._initialize_llm()
def _initialize_llm(self, random_seed: int = default_random_seed):
"""Initialize the LLM model.
Should only be used in a zero-shot (NoneTrainer) setting.
"""
self.model = LudwigModel.create_model(self.config_obj, random_seed=random_seed)
if self.model.model.device.type == "cpu" and torch.cuda.is_available():
logger.warning(f"LLM was initialized on {self.model.model.device}. Moving to GPU for inference.")
self.model.model.to(torch.device("cuda"))
def train(
self,
dataset: str | dict | pd.DataFrame | None = None,
training_set: str | dict | pd.DataFrame | Dataset | None = None,
validation_set: str | dict | pd.DataFrame | Dataset | None = None,
test_set: str | dict | pd.DataFrame | Dataset | None = None,
training_set_metadata: str | dict | None = None,
data_format: str | None = None,
experiment_name: str = "api_experiment",
model_name: str = "run",
model_resume_path: str | None = None,
skip_save_training_description: bool = False,
skip_save_training_statistics: bool = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
skip_save_processed_input: bool = False,
output_directory: str | None = "results",
random_seed: int = default_random_seed,
**kwargs,
) -> TrainingResults:
"""This function is used to perform a full training of the model on the specified dataset.
During training if the skip parameters are False
the model and statistics will be saved in a directory
`[output_dir]/[experiment_name]_[model_name]_n` where all variables are
resolved to user specified ones and `n` is an increasing number
starting from 0 used to differentiate among repeated runs.
# Inputs
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used in the experiment.
If it has a split column, it will be used for splitting
(0 for train, 1 for validation, 2 for test),
otherwise the dataset will be randomly split.
:param training_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing training data.
:param validation_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing validation data.
:param test_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing test data.
:param training_set_metadata: (Union[str, dict], default: `None`)
metadata JSON file or loaded metadata. Intermediate preprocessed
structure containing the mappings of the input dataset created the
first time an input file is used in the same directory with the
same name and a '.meta.json' extension.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'df'`, `'dict'`, `'excel'`,
`'feather'`, `'fwf'`,
`'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`),
`'json'`, `'jsonl'`, `'parquet'`,
`'pickle'` (pickled Pandas DataFrame),
`'sas'`, `'spss'`, `'stata'`, `'tsv'`.
:param experiment_name: (str, default: `'experiment'`) name for
the experiment.
:param model_name: (str, default: `'run'`) name of the model that is
being used.
:param model_resume_path: (str, default: `None`) resumes training of
the model from the path specified. The config is restored.
In addition to config, training statistics, loss for each
epoch and the state of the optimizer are restored such that
training can be effectively continued from a previously interrupted
training process.
:param skip_save_training_description: (bool, default: `False`)
disables saving the description JSON file.
:param skip_save_training_statistics: (bool, default: `False`)
disables saving training statistics JSON file.
:param skip_save_model: (bool, default: `False`) disables
saving model weights and hyperparameters each time the model
improves. By default Ludwig saves model weights after each epoch
the validation metric improves, but if the model is really big
that can be time consuming. If you do not want to keep
the weights and just find out what performance a model can get
with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on and the returned model
will have the weights obtained at the end of training, instead of
the weights of the epoch with the best validation performance.
:param skip_save_progress: (bool, default: `False`) disables saving
progress each epoch. By default Ludwig saves weights and stats
after each epoch for enabling resuming of training, but if
the model is really big that can be time consuming and will uses
twice as much space, use this parameter to skip it, but training
cannot be resumed later on.
:param skip_save_log: (bool, default: `False`) disables saving
TensorBoard logs. By default Ludwig saves logs for the TensorBoard,
but if it is not needed turning it off can slightly increase the
overall speed.
:param skip_save_processed_input: (bool, default: `False`) if input
dataset is provided it is preprocessed and cached by saving an HDF5
and JSON files to avoid running the preprocessing again. If this
parameter is `False`, the HDF5 and JSON file are not saved.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param random_seed: (int, default: `42`) a random seed that will be
used anywhere there is a call to a random number generator: data
splitting, parameter initialization and training set shuffling
:param kwargs: (dict, default: {}) a dictionary of optional parameters.
# Return
:return: (Tuple[Dict, Union[Dict, pd.DataFrame], str]) tuple containing
`(training_statistics, preprocessed_data, output_directory)`.
`training_statistics` is a nested dictionary of dataset -> feature_name -> metric_name -> List of metrics.
Each metric corresponds to each training checkpoint.
`preprocessed_data` is the tuple containing these three data sets
`(training_set, validation_set, test_set)`.
`output_directory` filepath to where training results are stored.
"""
# Only reset the metadata if the model has not been trained before
if self.training_set_metadata:
logger.warning(
"This model has been trained before. Its architecture has been defined by the original training set "
"(for example, the number of possible categorical outputs). The current training data will be mapped "
"to this architecture. If you want to change the architecture of the model, please concatenate your "
"new training data with the original and train a new model from scratch."
)
training_set_metadata = self.training_set_metadata
if self._user_config.get(HYPEROPT):
print_boxed("WARNING")
logger.warning(HYPEROPT_WARNING)
# setup directories and file names
if model_resume_path is not None:
if path_exists(model_resume_path):
output_directory = model_resume_path
if self.backend.is_coordinator():
logger.info(f"Model resume path '{model_resume_path}' exists, trying to resume training.")
else:
if self.backend.is_coordinator():
logger.info(
f"Model resume path '{model_resume_path}' does not exist, starting training from scratch"
)
model_resume_path = None
if model_resume_path is None:
if self.backend.is_coordinator():
output_directory = get_output_directory(output_directory, experiment_name, model_name)
else:
output_directory = None
# if we are skipping all saving,
# there is no need to create a directory that will remain empty
should_create_output_directory = not (
skip_save_training_description
and skip_save_training_statistics
and skip_save_model
and skip_save_progress
and skip_save_log
and skip_save_processed_input
)
output_url = output_directory
with upload_output_directory(output_directory) as (output_directory, upload_fn):
train_callbacks = self.callbacks
if upload_fn is not None:
# Upload output files (checkpoints, etc.) to remote storage at the end of
# each epoch and evaluation, in case of failure in the middle of training.
class UploadOnEpochEndCallback(Callback):
def on_eval_end(self, trainer, progress_tracker, save_path):
upload_fn()
def on_epoch_end(self, trainer, progress_tracker, save_path):
upload_fn()
train_callbacks = train_callbacks + [UploadOnEpochEndCallback()]
description_fn = training_stats_fn = model_dir = None
if self.backend.is_coordinator():
if should_create_output_directory:
makedirs(output_directory, exist_ok=True)
description_fn, training_stats_fn, model_dir = get_file_names(output_directory)
if isinstance(training_set, Dataset) and training_set_metadata is not None:
preprocessed_data = (training_set, validation_set, test_set, training_set_metadata)
else:
# save description
if self.backend.is_coordinator():
description = get_experiment_description(
self.config_obj.to_dict(),
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
backend=self.backend,
random_seed=random_seed,
)
if not skip_save_training_description:
save_json(description_fn, description)
# print description
experiment_description = [
["Experiment name", experiment_name],
["Model name", model_name],
["Output directory", output_directory],
]
for key, value in description.items():
if key != "config": # Config is printed separately.
experiment_description.append([key, pformat(value, indent=4)])
if self.backend.is_coordinator():
print_boxed("EXPERIMENT DESCRIPTION")
logger.info(tabulate(experiment_description, tablefmt="fancy_grid"))
print_boxed("LUDWIG CONFIG")
logger.info("User-specified config (with upgrades):\n")
logger.info(pformat(self._user_config, indent=4))
logger.info(
"\nFull config saved to:\n"
f"{output_directory}/{experiment_name}/model/model_hyperparameters.json"
)
preprocessed_data = self.preprocess( # type: ignore[assignment]
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
experiment_name=experiment_name,
model_name=model_name,
model_resume_path=model_resume_path,
skip_save_training_description=skip_save_training_description,
skip_save_training_statistics=skip_save_training_statistics,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
skip_save_processed_input=skip_save_processed_input,
output_directory=output_directory,
random_seed=random_seed,
**kwargs,
)
training_set, validation_set, test_set, training_set_metadata = preprocessed_data
self.training_set_metadata = training_set_metadata
if self.backend.is_coordinator():
dataset_statistics = generate_dataset_statistics(training_set, validation_set, test_set)
if not skip_save_model:
# save train set metadata
os.makedirs(model_dir, exist_ok=True) # type: ignore[arg-type]
save_json( # type: ignore[arg-type]
os.path.join(model_dir, TRAIN_SET_METADATA_FILE_NAME), training_set_metadata
)
logger.info("\nDataset Statistics")
logger.info(tabulate(dataset_statistics, headers="firstrow", tablefmt="fancy_grid"))
for callback in self.callbacks:
callback.on_train_init(
base_config=self._user_config,
experiment_directory=output_directory,
experiment_name=experiment_name,
model_name=model_name,
output_directory=output_directory,
resume_directory=model_resume_path,
)
# Build model if not provided
# if it was provided it means it was already loaded
if not self.model:
if self.backend.is_coordinator():
print_boxed("MODEL")
# update model config with metadata properties derived from training set
update_config_with_metadata(self.config_obj, training_set_metadata)
logger.info("Warnings and other logs:")
self.model = LudwigModel.create_model(self.config_obj, random_seed=random_seed)
# update config with properties determined during model instantiation
update_config_with_model(self.config_obj, self.model)
set_saved_weights_in_checkpoint_flag(self.config_obj)
# auto tune learning rate
if hasattr(self.config_obj.trainer, "learning_rate") and self.config_obj.trainer.learning_rate == AUTO:
detected_learning_rate = get_auto_learning_rate(self.config_obj)
self.config_obj.trainer.learning_rate = detected_learning_rate
with self.backend.create_trainer(
model=self.model,
config=self.config_obj.trainer,
resume=model_resume_path is not None,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
callbacks=train_callbacks,
random_seed=random_seed,
) as trainer:
# auto tune batch size
self._tune_batch_size(trainer, training_set, random_seed=random_seed)
if (
self.config_obj.model_type == "LLM"
and trainer.config.type == "none"
and self.config_obj.adapter is not None
and self.config_obj.adapter.pretrained_adapter_weights is not None
):
trainer.model.initialize_adapter() # Load pre-trained adapter weights for inference only
# train model
if self.backend.is_coordinator():
print_boxed("TRAINING")
if not skip_save_model:
self.save_config(model_dir)
for callback in self.callbacks:
callback.on_train_start(
model=self.model,
config=self.config_obj.to_dict(),
config_fp=self.config_fp,
)
try:
train_stats = trainer.train(
training_set,
validation_set=validation_set,
test_set=test_set,
save_path=model_dir,
)
self.model, train_trainset_stats, train_valiset_stats, train_testset_stats = train_stats
# Calibrates output feature probabilities on validation set if calibration is enabled.
# Must be done after training, and before final model parameters are saved.
if self.backend.is_coordinator():
calibrator = Calibrator(
self.model,
self.backend,
batch_size=trainer.eval_batch_size,
)
if calibrator.calibration_enabled():
if validation_set is None:
logger.warning(
"Calibration uses validation set, but no validation split specified."
"Will use training set for calibration."
"Recommend providing a validation set when using calibration."
)
calibrator.train_calibration(training_set, TRAINING)
elif len(validation_set) < MIN_DATASET_SPLIT_ROWS:
logger.warning(
f"Validation set size ({len(validation_set)} rows) is too small for calibration."
"Will use training set for calibration."
f"Validation set much have at least {MIN_DATASET_SPLIT_ROWS} rows."
)
calibrator.train_calibration(training_set, TRAINING)
else:
calibrator.train_calibration(validation_set, VALIDATION)
if not skip_save_model:
self.model.save(model_dir)
# Evaluation Frequency
if self.config_obj.model_type == MODEL_ECD and self.config_obj.trainer.steps_per_checkpoint:
evaluation_frequency = EvaluationFrequency(
self.config_obj.trainer.steps_per_checkpoint, EvaluationFrequency.STEP
)
elif self.config_obj.model_type == MODEL_ECD and self.config_obj.trainer.checkpoints_per_epoch:
evaluation_frequency = EvaluationFrequency(
1.0 / self.config_obj.trainer.checkpoints_per_epoch, EvaluationFrequency.EPOCH
)
else:
evaluation_frequency = EvaluationFrequency(1, EvaluationFrequency.EPOCH)
# Unpack train()'s return.
# The statistics are all nested dictionaries of TrainerMetrics: feature_name -> metric_name ->
# List[TrainerMetric], with one entry per training checkpoint, according to steps_per_checkpoint.
# We reduce the dictionary of TrainerMetrics to a simple list of floats for interfacing with Ray
# Tune.
train_stats = TrainingStats(
metric_utils.reduce_trainer_metrics_dict(train_trainset_stats),
metric_utils.reduce_trainer_metrics_dict(train_valiset_stats),
metric_utils.reduce_trainer_metrics_dict(train_testset_stats),
evaluation_frequency,
)
# save training statistics
if self.backend.is_coordinator():
if not skip_save_training_statistics:
save_json(training_stats_fn, train_stats)
# results of the model with highest validation test performance
if (
self.backend.is_coordinator()
and validation_set is not None
and not self.config_obj.trainer.skip_all_evaluation
):
print_boxed("TRAINING REPORT")
training_report = get_training_report(
trainer.validation_field,
trainer.validation_metric,
test_set is not None,
train_valiset_stats,
train_testset_stats,
)
logger.info(tabulate(training_report, tablefmt="fancy_grid"))
logger.info(f"\nFinished: {experiment_name}_{model_name}")
logger.info(f"Saved to: {output_directory}")
finally:
for callback in self.callbacks:
callback.on_train_end(output_directory)
self.training_set_metadata = training_set_metadata
if self.is_merge_and_unload_set():
# For an LLM model trained with a LoRA adapter, merge first, then save the full model.
self.model.merge_and_unload(progressbar=self.config_obj.adapter.postprocessor.progressbar)
if self.backend.is_coordinator() and not skip_save_model:
self.model.save_base_model(model_dir)
elif self.backend.is_coordinator() and not skip_save_model:
self.model.save(model_dir)
# Synchronize model weights between workers
self.backend.sync_model(self.model)
print_boxed("FINISHED")
return TrainingResults(train_stats, preprocessed_data, output_url)
def train_online(
self,
dataset: str | dict | pd.DataFrame,
training_set_metadata: str | dict | None = None,
data_format: str = "auto",
random_seed: int = default_random_seed,
) -> None:
"""Performs one epoch of training of the model on `dataset`.
# Inputs
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used in the experiment.
If it has a split column, it will be used for splitting (0 for train,
1 for validation, 2 for test), otherwise the dataset will be
randomly split.
:param training_set_metadata: (Union[str, dict], default: `None`)
metadata JSON file or loaded metadata. Intermediate preprocessed
structure containing the mappings of the input
dataset created the first time an input file is used in the same
directory with the same name and a '.meta.json' extension.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'df'`, `'dict'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param random_seed: (int, default: `42`) a random seed that is going to be
used anywhere there is a call to a random number generator: data
splitting, parameter initialization and training set shuffling
# Return
:return: (None) `None`
"""
training_set_metadata = training_set_metadata or self.training_set_metadata
preprocessing_params = get_preprocessing_params(self.config_obj)
with provision_preprocessing_workers(self.backend):
# TODO (Connor): Refactor to use self.config_obj
training_dataset, _, _, training_set_metadata = preprocess_for_training(
self.config_obj.to_dict(),
training_set=dataset,
training_set_metadata=training_set_metadata,
data_format=data_format,
skip_save_processed_input=True,
preprocessing_params=preprocessing_params,
backend=self.backend,
random_seed=random_seed,
callbacks=self.callbacks,
)
if not self.training_set_metadata:
self.training_set_metadata = training_set_metadata
if not self.model:
update_config_with_metadata(self.config_obj, training_set_metadata)
self.model = LudwigModel.create_model(self.config_obj, random_seed=random_seed)
# update config with properties determined during model instantiation
update_config_with_model(self.config_obj, self.model)
set_saved_weights_in_checkpoint_flag(self.config_obj)
if not self._online_trainer:
self._online_trainer = self.backend.create_trainer(
config=self.config_obj.trainer, model=self.model, random_seed=random_seed
)
self._tune_batch_size(self._online_trainer, dataset, random_seed=random_seed)
self.model = self._online_trainer.train_online(training_dataset)
def _tune_batch_size(self, trainer, dataset, random_seed: int = default_random_seed):
"""Sets AUTO batch-size-related parameters based on the trainer, backend type, and number of workers.
Batch-size related parameters that are set:
- trainer.batch_size
- trainer.eval_batch_size
- trainer.gradient_accumulation_steps
- trainer.effective_batch_size
The final batch size selected may be non-deterministic even with a fixed random seed since throughput-based
heuristics may be affected by resources used by other processes running on the machine.
"""
if not self.config_obj.trainer.can_tune_batch_size():
# Some model types don't have batch sizes to be tuned
return
# Render the batch size and gradient accumulation steps prior to batch size tuning. This is needed in the event
# the effective_batch_size and gradient_accumulation_steps are set explicitly, but batch_size is AUTO. In this
# case, we can infer the batch_size directly without tuning.
num_workers = self.backend.num_training_workers
self.config_obj.trainer.update_batch_size_grad_accum(num_workers)
# TODO (ASN): add support for substitute_with_max parameter
# TODO(travis): detect train and eval batch sizes separately (enable / disable gradients)
if self.config_obj.trainer.batch_size == AUTO:
if self.backend.supports_batch_size_tuning():
tuned_batch_size = trainer.tune_batch_size(
self.config_obj.to_dict(), dataset, random_seed=random_seed, tune_for_training=True
)
else:
logger.warning(
f"Backend {self.backend.BACKEND_TYPE} does not support batch size tuning, "
f"using fallback training batch size {FALLBACK_BATCH_SIZE}."
)
tuned_batch_size = FALLBACK_BATCH_SIZE
# TODO(travis): pass these in as args to trainer when we call train,
# to avoid setting state on possibly remote trainer
self.config_obj.trainer.batch_size = tuned_batch_size
# Re-render the gradient_accumulation_steps to account for the explicit batch size.
self.config_obj.trainer.update_batch_size_grad_accum(num_workers)
if self.config_obj.trainer.eval_batch_size in {AUTO, None}:
if self.backend.supports_batch_size_tuning():
tuned_batch_size = trainer.tune_batch_size(
self.config_obj.to_dict(), dataset, random_seed=random_seed, tune_for_training=False
)
else:
logger.warning(
f"Backend {self.backend.BACKEND_TYPE} does not support batch size tuning, "
f"using fallback eval batch size {FALLBACK_BATCH_SIZE}."
)
tuned_batch_size = FALLBACK_BATCH_SIZE
self.config_obj.trainer.eval_batch_size = tuned_batch_size
# Update trainer params separate to config params for backends with stateful trainers
trainer.batch_size = self.config_obj.trainer.batch_size
trainer.eval_batch_size = self.config_obj.trainer.eval_batch_size
trainer.gradient_accumulation_steps = self.config_obj.trainer.gradient_accumulation_steps
def save_dequantized_base_model(self, save_path: str) -> None:
"""Upscales quantized weights of a model to fp16 and saves the result in a specified folder.
Args:
save_path (str): The path to the folder where the upscaled model weights will be saved.
Raises:
ValueError:
If the model type is not 'llm' or if quantization is not enabled or the number of bits is not 4 or 8.
RuntimeError:
If no GPU is available, as GPU is required for quantized models.
Returns:
None
"""
if self.config_obj.model_type != MODEL_LLM:
raise ValueError(
f"Model type {self.config_obj.model_type} is not supported by this method. Only `llm` model type is "
"supported."
)
if not self.config_obj.quantization:
raise ValueError(
"Quantization is not enabled in your Ludwig model config. "
"To enable quantization, set `quantization` to `{'bits': 4}` or `{'bits': 8}` in your model config."
)
if self.config_obj.quantization.bits != 4:
raise ValueError(
"This method only works with quantized models with 4 bits. "
"Support for 8-bit quantized models will be added in a future release."
)
if not torch.cuda.is_available():
raise RuntimeError("GPU is required for quantized models but no GPU found.")
# Create the LLM model class instance with the loaded LLM if it hasn't been initialized yet.
if not self.model:
self.model = LudwigModel.create_model(self.config_obj)
self.model.save_dequantized_base_model(save_path)
logger.info(
"If you want to upload this model to huggingface.co, run the following Python commands: \n"
"from ludwig.utils.hf_utils import upload_folder_to_hfhub; \n"
f"upload_folder_to_hfhub(repo_id='desired/huggingface/repo/name', folder_path='{save_path}')"
)
def generate(
self,
input_strings: str | list[str],
generation_config: dict | None = None,
streaming: bool | None = False,
) -> str | list[str]:
"""A simple generate() method that directly uses the underlying transformers library to generate text.
Args:
input_strings (Union[str, List[str]]): Input text or list of texts to generate from.
generation_config (Optional[dict]): Configuration for text generation.
streaming (Optional[bool]): If True, enable streaming output.
Returns:
Union[str, List[str]]: Generated text or list of generated texts.
"""
if self.config_obj.model_type != MODEL_LLM:
raise ValueError(
f"Model type {self.config_obj.model_type} is not supported by this method. Only `llm` model type is "
"supported."
)
if not torch.cuda.is_available():
# GPU is generally well-advised for working with LLMs and is required for loading quantized models, see
# https://github.com/ludwig-ai/ludwig/issues/3695.
raise ValueError("GPU is not available.")
# TODO(Justin): Decide if it's worth folding padding_side handling into llm.py's tokenizer initialization.
# For batch inference with models like facebook/opt-350m, if the tokenizer padding side is off, HF prints a
# warning, e.g.:
# "A decoder-only architecture is being used, but right-padding was detected! For correct generation results, "
# "please set `padding_side='left'` when initializing the tokenizer.
padding_side = "left" if not self.model.model.config.is_encoder_decoder else "right"
tokenizer = HFTokenizer(self.config_obj.base_model, padding_side=padding_side)
with self.model.use_generation_config(generation_config):
start_time = time.time()
tokenized_inputs = tokenizer.tokenizer(input_strings, return_tensors="pt", padding=True)
input_ids = tokenized_inputs["input_ids"].to("cuda")
attention_mask = tokenized_inputs["attention_mask"].to("cuda")
if streaming:
streamer = create_text_streamer(tokenizer.tokenizer)
outputs = self._generate_streaming_outputs(input_strings, input_ids, attention_mask, streamer)
else:
outputs = self._generate_non_streaming_outputs(input_strings, input_ids, attention_mask)
decoded_outputs = tokenizer.tokenizer.batch_decode(outputs, skip_special_tokens=True)
logger.info(f"Finished generating in: {(time.time() - start_time):.2f}s.")
return decoded_outputs[0] if len(decoded_outputs) == 1 else decoded_outputs
def _generate_streaming_outputs(
self,
input_strings: str | list[str],
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
streamer: TextStreamer,
) -> torch.Tensor:
"""Generate streaming outputs for the given input.
Args:
input_strings (Union[str, List[str]]): Input text or list of texts to generate from.
input_ids (torch.Tensor): Tensor containing input IDs.
attention_mask (torch.Tensor): Tensor containing attention masks.
streamer (Union[TextStreamer, None]): Text streamer instance for streaming output.
Returns:
torch.Tensor: Concatenated tensor of generated outputs.
"""
outputs = []
input_strings = input_strings if isinstance(input_strings, list) else [input_strings]
for i in range(len(input_ids)):
with torch.no_grad():
logger.info(f"Input: {input_strings[i]}\n")
# NOTE: self.model.model.generation_config is not used here because it is the default
# generation config that the CausalLM was initialized with, rather than the one set within the
# context manager.
generated_output = self.model.model.generate(
input_ids=input_ids[i].unsqueeze(0),
attention_mask=attention_mask[i].unsqueeze(0),
generation_config=self.model.generation,
streamer=streamer,
)
logger.info("----------------------")
outputs.append(generated_output)
return torch.cat(outputs, dim=0)
def _generate_non_streaming_outputs(
self,
_input_strings: str | list[str],
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
) -> torch.Tensor:
"""Generate non-streaming outputs for the given input.
Args:
_input_strings (Union[str, List[str]]): Unused input parameter.
input_ids (torch.Tensor): Tensor containing input IDs.
attention_mask (torch.Tensor): Tensor containing attention masks.
streamer (Union[TextStreamer, None]): Text streamer instance for streaming output.
Returns:
torch.Tensor: Tensor of generated outputs.
"""
with torch.no_grad():
# NOTE: self.model.model.generation_config is not used here because it is the default
# generation config that the CausalLM was initialized with, rather than the one set within the
# context manager.
return self.model.model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
generation_config=self.model.generation,
)
def predict(
self,
dataset: str | dict | pd.DataFrame | None = None,
data_format: str = None,
split: str = FULL,
batch_size: int = 128,
generation_config: dict | None = None,
skip_save_unprocessed_output: bool = True,
skip_save_predictions: bool = True,
output_directory: str = "results",
return_type: str | dict | pd.DataFrame = pd.DataFrame,
callbacks: list[Callback] | None = None,
**kwargs,
) -> tuple[dict | pd.DataFrame, str]:
"""Using a trained model, make predictions from the provided dataset.
# Inputs
:param dataset: (Union[str, dict, pandas.DataFrame]): source containing the entire dataset to be evaluated.
:param data_format: (str, default: `None`) format to interpret data sources. Will be inferred automatically
if not specified. Valid formats are `'auto'`, `'csv'`, `'df'`, `'dict'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training), `'html'` (file containing a single
HTML `
`), `'json'`, `'jsonl'`, `'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`,
`'spss'`, `'stata'`, `'tsv'`.
:param split: (str, default= `'full'`): if the input dataset contains a split column, this parameter
indicates which split of the data to use. Possible values are `'full'`, `'training'`, `'validation'`,
`'test'`.
:param batch_size: (int, default: 128) size of batch to use when making predictions.
:param generation_config: (Dict, default: `None`) config for the generation of the
predictions. If `None`, the config that was used during model training is
used. This is only used if the model type is LLM. Otherwise, this parameter is
ignored. See
[Large Language Models](https://ludwig.ai/latest/configuration/large_language_model/#generation) under
"Generation" for an example generation config.
:param skip_save_unprocessed_output: (bool, default: `True`) if this parameter is `False`, predictions and
their probabilities are saved in both raw unprocessed numpy files containing tensors and as
postprocessed CSV files (one for each output feature). If this parameter is `True`, only the CSV ones
are saved and the numpy ones are skipped.
:param skip_save_predictions: (bool, default: `True`) skips saving test predictions CSV files.
:param output_directory: (str, default: `'results'`) the directory that will contain the training
statistics, TensorBoard logs, the saved model and the training progress files.
:param return_type: (Union[str, dict, pandas.DataFrame], default: pd.DataFrame) indicates the format of the
returned predictions.
:param callbacks: (Optional[List[Callback]], default: None) optional list of callbacks to use during this
predict operation. Any callbacks already registered to the model will be preserved.
# Return
:return `(predictions, output_directory)`: (Tuple[Union[dict, pd.DataFrame], str])
`predictions` predictions from the provided dataset,
`output_directory` filepath string to where data was stored.
"""
self._check_initialization()
# preprocessing
start_time = time.time()
logger.debug("Preprocessing")
dataset, _ = preprocess_for_prediction( # TODO (Connor): Refactor to use self.config_obj
self.config_obj.to_dict(),
dataset=dataset,
training_set_metadata=self.training_set_metadata,
data_format=data_format,
split=split,
include_outputs=False,
backend=self.backend,
callbacks=self.callbacks + (callbacks or []),
)
logger.debug("Predicting")
with self.backend.create_predictor(self.model, batch_size=batch_size) as predictor:
with self.model.use_generation_config(generation_config):
predictions = predictor.batch_predict(
dataset,
)
if self.backend.is_coordinator():
# if we are skipping all saving,
# there is no need to create a directory that will remain empty
should_create_exp_dir = not (skip_save_unprocessed_output and skip_save_predictions)
if should_create_exp_dir:
makedirs(output_directory, exist_ok=True)
logger.debug("Postprocessing")
postproc_predictions = postprocess(
predictions,
self.model.output_features,
self.training_set_metadata,
output_directory=output_directory,
backend=self.backend,
skip_save_unprocessed_output=skip_save_unprocessed_output or not self.backend.is_coordinator(),
)
converted_postproc_predictions = convert_predictions(
postproc_predictions, self.model.output_features, return_type=return_type, backend=self.backend
)
if self.backend.is_coordinator():
if not skip_save_predictions:
save_prediction_outputs(
postproc_predictions, self.model.output_features, output_directory, self.backend
)
logger.info(f"Saved to: {output_directory}")
logger.info(f"Finished predicting in: {(time.time() - start_time):.2f}s.")
return converted_postproc_predictions, output_directory
def evaluate(
self,
dataset: str | dict | pd.DataFrame | None = None,
data_format: str | None = None,
split: str = FULL,
batch_size: int | None = None,
skip_save_unprocessed_output: bool = True,
skip_save_predictions: bool = True,
skip_save_eval_stats: bool = True,
collect_predictions: bool = False,
collect_overall_stats: bool = False,
output_directory: str = "results",
return_type: str | dict | pd.DataFrame = pd.DataFrame,
**kwargs,
) -> tuple[dict, dict | pd.DataFrame, str]:
"""This function is used to predict the output variables given the input variables using the trained model
and compute test statistics like performance measures, confusion matrices and the like.
# Inputs
:param dataset: (Union[str, dict, pandas.DataFrame]) source containing
the entire dataset to be evaluated.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'df'`, `'dict'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param split: (str, default=`'full'`): if the input dataset contains
a split column, this parameter indicates which split of the data
to use. Possible values are `'full'`, `'training'`, `'validation'`, `'test'`.
:param batch_size: (int, default: None) size of batch to use when making
predictions. Defaults to model config eval_batch_size
:param skip_save_unprocessed_output: (bool, default: `True`) if this
parameter is `False`, predictions and their probabilities are saved
in both raw unprocessed numpy files containing tensors and as
postprocessed CSV files (one for each output feature).
If this parameter is `True`, only the CSV ones are saved and the
numpy ones are skipped.
:param skip_save_predictions: (bool, default: `True`) skips saving
test predictions CSV files.
:param skip_save_eval_stats: (bool, default: `True`) skips saving
test statistics JSON file.
:param collect_predictions: (bool, default: `False`) if `True`
collects post-processed predictions during eval.
:param collect_overall_stats: (bool, default: False) if `True`
collects overall stats during eval.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param return_type: (Union[str, dict, pd.DataFrame], default: pandas.DataFrame) indicates
the format to of the returned predictions.
# Return
:return: (`evaluation_statistics`, `predictions`, `output_directory`)
`evaluation_statistics` dictionary containing evaluation performance
statistics,
`postprocess_predictions` contains predicted values,
`output_directory` is location where results are stored.
"""
self._check_initialization()
for callback in self.callbacks:
callback.on_evaluation_start()
# preprocessing
logger.debug("Preprocessing")
dataset, training_set_metadata = preprocess_for_prediction( # TODO (Connor): Refactor to use self.config_obj
self.config_obj.to_dict(),
dataset=dataset,
training_set_metadata=self.training_set_metadata,
data_format=data_format,
split=split,
include_outputs=True,
backend=self.backend,
callbacks=self.callbacks,
)
# Fallback to use eval_batch_size or batch_size if not provided
if batch_size is None:
# Requires dictionary getter since some trainer configs may not have a batch_size param
batch_size = self.config_obj.trainer.to_dict().get(
EVAL_BATCH_SIZE, None
) or self.config_obj.trainer.to_dict().get(BATCH_SIZE, None)
logger.debug("Predicting")
with self.backend.create_predictor(self.model, batch_size=batch_size) as predictor:
eval_stats, predictions = predictor.batch_evaluation(
dataset,
collect_predictions=collect_predictions or collect_overall_stats,
)
# calculate the overall metrics
if collect_overall_stats:
dataset = dataset.to_df()
overall_stats = calculate_overall_stats(
self.model.output_features, predictions, dataset, training_set_metadata
)
eval_stats = {
of_name: (
{**eval_stats[of_name], **overall_stats[of_name]}
# account for presence of 'combined' key
if of_name in overall_stats
else {**eval_stats[of_name]}
)
for of_name in eval_stats
}
if self.backend.is_coordinator():
# if we are skipping all saving,
# there is no need to create a directory that will remain empty
should_create_exp_dir = not (
skip_save_unprocessed_output and skip_save_predictions and skip_save_eval_stats
)
if should_create_exp_dir:
makedirs(output_directory, exist_ok=True)
if collect_predictions:
logger.debug("Postprocessing")
postproc_predictions = postprocess(
predictions,
self.model.output_features,
self.training_set_metadata,
output_directory=output_directory,
backend=self.backend,
skip_save_unprocessed_output=skip_save_unprocessed_output or not self.backend.is_coordinator(),
)
else:
postproc_predictions = predictions # = {}
if self.backend.is_coordinator():
should_save_predictions = (
collect_predictions and postproc_predictions is not None and not skip_save_predictions
)
if should_save_predictions:
save_prediction_outputs(
postproc_predictions, self.model.output_features, output_directory, self.backend
)
print_evaluation_stats(eval_stats)
if not skip_save_eval_stats:
save_evaluation_stats(eval_stats, output_directory)
if should_save_predictions or not skip_save_eval_stats:
logger.info(f"Saved to: {output_directory}")
if collect_predictions:
postproc_predictions = convert_predictions(
postproc_predictions, self.model.output_features, return_type=return_type, backend=self.backend
)
for callback in self.callbacks:
callback.on_evaluation_end()
return eval_stats, postproc_predictions, output_directory
def forecast(
self,
dataset: DataFrame,
data_format: str | None = None,
horizon: int = 1,
output_directory: str | None = None,
output_format: str = "parquet",
) -> DataFrame:
# TODO(travis): WIP
dataset, _, _, _ = load_dataset_uris(dataset, None, None, None, self.backend)
if isinstance(dataset, CacheableDataset):
dataset = dataset.unwrap()
dataset = load_dataset(dataset, data_format=data_format, df_lib=self.backend.df_engine.df_lib)
window_sizes = [
feature.preprocessing.window_size
for feature in self.config_obj.input_features
if feature.type == TIMESERIES
]
if not window_sizes:
raise ValueError("Forecasting requires at least one input feature of type `timeseries`.")
# TODO(travis): there's a lot of redundancy in this approach, since we are preprocessing the same DataFrame
# multiple times with only a small number of features (the horizon) being appended each time.
# A much better approach would be to only preprocess a single row, but incorporating the row-level embedding
# over the window_size of rows precending it, then performing the model forward pass on only that row of
# data.
max_lookback_window_size = max(window_sizes)
total_forecasted = 0
while total_forecasted < horizon:
# We only need the last `window_size` worth of rows to forecast the next value
dataset = dataset.tail(max_lookback_window_size)
# Run through preprocessing and prediction to obtain row-wise next values
# TODO(travis): can optimize the preprocessing part here, since we only need to preprocess / predict
# the last row, not the last `window_size` rows.
preds, _ = self.predict(dataset, skip_save_predictions=True, skip_save_unprocessed_output=True)
next_series = {}
for feature in self.config_obj.output_features:
if feature.type == TIMESERIES:
key = f"{feature.name}_predictions"
next_series[feature.column] = pd.Series(preds[key].iloc[-1])
next_preds = pd.DataFrame(next_series)
dataset = pd.concat([dataset, next_preds], axis=0).reset_index(drop=True)
total_forecasted += len(next_preds)
horizon_df = dataset.tail(total_forecasted).head(horizon)
return_cols = [feature.column for feature in self.config_obj.output_features if feature.type == TIMESERIES]
results_df = horizon_df[return_cols]
if output_directory is not None:
if self.backend.is_coordinator():
# TODO(travis): generalize this to support any pandas output format
if output_format == "parquet":
output_path = os.path.join(output_directory, "forecast.parquet")
results_df.to_parquet(output_path)
elif output_format == "csv":
output_path = os.path.join(output_directory, "forecast.csv")
results_df.to_csv(output_path)
else:
raise ValueError(f"`output_format` {output_format} not supported. Must be one of [parquet, csv]")
logger.info(f"Saved to: {output_path}")
return results_df
def experiment(
self,
dataset: str | dict | pd.DataFrame | None = None,
training_set: str | dict | pd.DataFrame | None = None,
validation_set: str | dict | pd.DataFrame | None = None,
test_set: str | dict | pd.DataFrame | None = None,
training_set_metadata: str | dict | None = None,
data_format: str | None = None,
experiment_name: str = "experiment",
model_name: str = "run",
model_resume_path: str | None = None,
eval_split: str = TEST,
skip_save_training_description: bool = False,
skip_save_training_statistics: bool = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
skip_save_processed_input: bool = False,
skip_save_unprocessed_output: bool = False,
skip_save_predictions: bool = False,
skip_save_eval_stats: bool = False,
skip_collect_predictions: bool = False,
skip_collect_overall_stats: bool = False,
output_directory: str = "results",
random_seed: int = default_random_seed,
**kwargs,
) -> tuple[dict | None, TrainingStats, PreprocessedDataset, str]:
"""Trains a model on a dataset's training and validation splits and uses it to predict on the test split.
It saves the trained model and the statistics of training and testing.
# Inputs
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used in the experiment.
If it has a split column, it will be used for splitting (0 for train,
1 for validation, 2 for test), otherwise the dataset will be
randomly split.
:param training_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing training data.
:param validation_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing validation data.
:param test_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing test data.
:param training_set_metadata: (Union[str, dict], default: `None`)
metadata JSON file or loaded metadata. Intermediate preprocessed
structure containing the mappings of the input
dataset created the first time an input file is used in the same
directory with the same name and a '.meta.json' extension.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'df'`, `'dict'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param experiment_name: (str, default: `'experiment'`) name for
the experiment.
:param model_name: (str, default: `'run'`) name of the model that is
being used.
:param model_resume_path: (str, default: `None`) resumes training of
the model from the path specified. The config is restored.
In addition to config, training statistics and loss for
epoch and the state of the optimizer are restored such that
training can be effectively continued from a previously interrupted
training process.
:param eval_split: (str, default: `test`) split on which
to perform evaluation. Valid values are `training`, `validation`
and `test`.
:param skip_save_training_description: (bool, default: `False`) disables
saving the description JSON file.
:param skip_save_training_statistics: (bool, default: `False`) disables
saving training statistics JSON file.
:param skip_save_model: (bool, default: `False`) disables
saving model weights and hyperparameters each time the model
improves. By default Ludwig saves model weights after each epoch
the validation metric improves, but if the model is really big
that can be time consuming. If you do not want to keep
the weights and just find out what performance a model can get
with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on and the returned model
will have the weights obtained at the end of training, instead of
the weights of the epoch with the best validation performance.
:param skip_save_progress: (bool, default: `False`) disables saving
progress each epoch. By default Ludwig saves weights and stats
after each epoch for enabling resuming of training, but if
the model is really big that can be time consuming and will uses
twice as much space, use this parameter to skip it, but training
cannot be resumed later on.
:param skip_save_log: (bool, default: `False`) disables saving
TensorBoard logs. By default Ludwig saves logs for the TensorBoard,
but if it is not needed turning it off can slightly increase the
overall speed.
:param skip_save_processed_input: (bool, default: `False`) if input
dataset is provided it is preprocessed and cached by saving an HDF5
and JSON files to avoid running the preprocessing again. If this
parameter is `False`, the HDF5 and JSON file are not saved.
:param skip_save_unprocessed_output: (bool, default: `False`) by default
predictions and their probabilities are saved in both raw
unprocessed numpy files containing tensors and as postprocessed
CSV files (one for each output feature). If this parameter is True,
only the CSV ones are saved and the numpy ones are skipped.
:param skip_save_predictions: (bool, default: `False`) skips saving test
predictions CSV files
:param skip_save_eval_stats: (bool, default: `False`) skips saving test
statistics JSON file
:param skip_collect_predictions: (bool, default: `False`) skips
collecting post-processed predictions during eval.
:param skip_collect_overall_stats: (bool, default: `False`) skips
collecting overall stats during eval.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param random_seed: (int: default: 42) random seed used for weights
initialization, splits and any other random function.
# Return
:return: (Tuple[dict, dict, tuple, str))
`(evaluation_statistics, training_statistics, preprocessed_data, output_directory)`
`evaluation_statistics` dictionary with evaluation performance
statistics on the test_set,
`training_statistics` is a nested dictionary of dataset -> feature_name -> metric_name -> List of metrics.
Each metric corresponds to each training checkpoint.
`preprocessed_data` tuple containing preprocessed
`(training_set, validation_set, test_set)`, `output_directory`
filepath string to where results are stored.
"""
if self._user_config.get(HYPEROPT):
print_boxed("WARNING")
logger.warning(HYPEROPT_WARNING)
train_stats, preprocessed_data, output_directory = self.train(
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
experiment_name=experiment_name,
model_name=model_name,
model_resume_path=model_resume_path,
skip_save_training_description=skip_save_training_description,
skip_save_training_statistics=skip_save_training_statistics,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
skip_save_processed_input=skip_save_processed_input,
skip_save_unprocessed_output=skip_save_unprocessed_output,
output_directory=output_directory,
random_seed=random_seed,
)
training_set, validation_set, test_set, training_set_metadata = preprocessed_data
eval_set = validation_set
if eval_split == TRAINING:
eval_set = training_set
elif eval_split == VALIDATION:
eval_set = validation_set
elif eval_split == TEST:
eval_set = test_set
else:
logger.warning(f"Eval split {eval_split} not supported. " f"Using validation set instead")
if eval_set is not None:
trainer_dict = self.config_obj.trainer.to_dict()
batch_size = trainer_dict.get(EVAL_BATCH_SIZE, trainer_dict.get(BATCH_SIZE, None))
# predict
try:
eval_stats, _, _ = self.evaluate(
eval_set,
data_format=data_format,
batch_size=batch_size,
output_directory=output_directory,
skip_save_unprocessed_output=skip_save_unprocessed_output,
skip_save_predictions=skip_save_predictions,
skip_save_eval_stats=skip_save_eval_stats,
collect_predictions=not skip_collect_predictions,
collect_overall_stats=not skip_collect_overall_stats,
return_type="dict",
)
except NotImplementedError:
logger.warning(
"Skipping evaluation as the necessary methods are not "
"supported. Full exception below:\n"
f"{traceback.format_exc()}"
)
eval_stats = None
else:
logger.warning(f"The evaluation set {eval_set} was not provided. " f"Skipping evaluation")
eval_stats = None
return eval_stats, train_stats, preprocessed_data, output_directory
def collect_weights(self, tensor_names: list[str] = None, **kwargs) -> list:
"""Load a pre-trained model and collect the tensors with a specific name.
# Inputs
:param tensor_names: (list, default: `None`) List of tensor names to collect
weights
# Return
:return: (list) List of tensors
"""
self._check_initialization()
collected_tensors = self.model.collect_weights(tensor_names)
return collected_tensors
def collect_activations(
self,
layer_names: list[str],
dataset: str | dict[str, list] | pd.DataFrame,
data_format: str | None = None,
split: str = FULL,
batch_size: int = 128,
**kwargs,
) -> list:
"""Loads a pre-trained model model and input data to collect the values of the activations contained in the
tensors.
# Inputs
:param layer_names: (list) list of strings for layer names in the model
to collect activations.
:param dataset: (Union[str, Dict[str, list], pandas.DataFrame]) source
containing the data to make predictions.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'df'`, `'dict'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param split: (str, default= `'full'`): if the input dataset contains
a split column, this parameter indicates which split of the data
to use. Possible values are `'full'`, `'training'`, `'validation'`, `'test'`.
:param batch_size: (int, default: 128) size of batch to use when making
predictions.
# Return
:return: (list) list of collected tensors.
"""
self._check_initialization()
# preprocessing
logger.debug("Preprocessing")
dataset, training_set_metadata = preprocess_for_prediction( # TODO (Connor): Refactor to use self.config_obj
self.config_obj.to_dict(),
dataset=dataset,
training_set_metadata=self.training_set_metadata,
data_format=data_format,
split=split,
include_outputs=False,
)
logger.debug("Predicting")
with self.backend.create_predictor(self.model, batch_size=batch_size) as predictor:
activations = predictor.batch_collect_activations(
layer_names,
dataset,
)
return activations
def preprocess(
self,
dataset: str | dict | pd.DataFrame | None = None,
training_set: str | dict | pd.DataFrame | None = None,
validation_set: str | dict | pd.DataFrame | None = None,
test_set: str | dict | pd.DataFrame | None = None,
training_set_metadata: str | dict | None = None,
data_format: str | None = None,
skip_save_processed_input: bool = True,
random_seed: int = default_random_seed,
**kwargs,
) -> PreprocessedDataset:
"""This function is used to preprocess data.
# Args:
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used in the experiment.
If it has a split column, it will be used for splitting
(0 for train, 1 for validation, 2 for test),
otherwise the dataset will be randomly split.
:param training_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing training data.
:param validation_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing validation data.
:param test_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing test data.
:param training_set_metadata: (Union[str, dict], default: `None`)
metadata JSON file or loaded metadata. Intermediate preprocessed
structure containing the mappings of the input
dataset created the first time an input file is used in the same
directory with the same name and a '.meta.json' extension.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'df'`, `'dict'`, `'excel'`,
`'feather'`, `'fwf'`,
`'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`),
`'json'`, `'jsonl'`, `'parquet'`,
`'pickle'` (pickled Pandas DataFrame),
`'sas'`, `'spss'`, `'stata'`, `'tsv'`.
:param skip_save_processed_input: (bool, default: `False`) if input
dataset is provided it is preprocessed and cached by saving an HDF5
and JSON files to avoid running the preprocessing again. If this
parameter is `False`, the HDF5 and JSON file are not saved.
:param random_seed: (int, default: `42`) a random seed that will be
used anywhere there is a call to a random number generator: data
splitting, parameter initialization and training set shuffling
# Returns:
:return: (PreprocessedDataset) data structure containing
`(proc_training_set, proc_validation_set, proc_test_set, training_set_metadata)`.
# Raises:
RuntimeError: An error occurred while preprocessing the data. Examples include training dataset
being empty after preprocessing, lazy loading not being supported with RayBackend, etc.
"""
print_boxed("PREPROCESSING")
for callback in self.callbacks:
callback.on_preprocess_start(self.config_obj.to_dict())
preprocessing_params = get_preprocessing_params(self.config_obj)
proc_training_set = proc_validation_set = proc_test_set = None
try:
with provision_preprocessing_workers(self.backend):
# TODO (Connor): Refactor to use self.config_obj
preprocessed_data = preprocess_for_training(
self.config_obj.to_dict(),
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=self.backend,
random_seed=random_seed,
callbacks=self.callbacks,
)
proc_training_set, proc_validation_set, proc_test_set, training_set_metadata = preprocessed_data
return PreprocessedDataset(proc_training_set, proc_validation_set, proc_test_set, training_set_metadata)
except Exception as e:
raise RuntimeError(f"Caught exception during model preprocessing: {str(e)}") from e
finally:
for callback in self.callbacks:
callback.on_preprocess_end(proc_training_set, proc_validation_set, proc_test_set, training_set_metadata)
@staticmethod
def load(
model_dir: str,
logging_level: int = logging.ERROR,
backend: Backend | str | None = None,
gpus: str | int | list[int] | None = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
callbacks: list[Callback] = None,
from_checkpoint: bool = False,
) -> "LudwigModel": # return is an instance of ludwig.api.LudwigModel class
"""This function allows for loading pretrained models.
# Inputs
:param model_dir: (str) path to the directory containing the model.
If the model was trained by the `train` or `experiment` command,
the model is in `results_dir/experiment_dir/model`.
:param logging_level: (int, default: 40) log level that will be sent to
stderr.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param gpus: (Union[str, int, List[int]], default: `None`) GPUs
to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow Torch
to use
multithreading parallelism to improve performance at the cost of
determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param from_checkpoint: (bool, default: `False`) if `True`, the model
will be loaded from the latest checkpoint (training_checkpoints/)
instead of the final model weights.
# Return
:return: (LudwigModel) a LudwigModel object
# Example usage
```python
ludwig_model = LudwigModel.load(model_dir)
```
"""
# Initialize PyTorch before calling `broadcast()` to prevent initializing
# Torch with default parameters
backend_param = backend
backend = initialize_backend(backend)
backend.initialize_pytorch(
gpus=gpus, gpu_memory_limit=gpu_memory_limit, allow_parallel_threads=allow_parallel_threads
)
config = backend.broadcast_return(lambda: load_json(os.path.join(model_dir, MODEL_HYPERPARAMETERS_FILE_NAME)))
# Upgrades deprecated fields and adds new required fields in case the config loaded from disk is old.
config_obj = ModelConfig.from_dict(config)
# Ensure that the original backend is used if it was specified in the config and user requests it
if backend_param is None and "backend" in config:
# Reset backend from config
backend = initialize_backend(config.get("backend"))
# initialize model
ludwig_model = LudwigModel(
config_obj.to_dict(),
logging_level=logging_level,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
)
# generate model from config
set_saved_weights_in_checkpoint_flag(config_obj)
ludwig_model.model = LudwigModel.create_model(config_obj)
# load model weights
ludwig_model.load_weights(model_dir, from_checkpoint)
# If merge_and_unload was NOT performed before saving (i.e., adapter weights exist),
# we need to merge them now for inference.
if ludwig_model.is_merge_and_unload_set():
weights_save_path = os.path.join(model_dir, MODEL_WEIGHTS_FILE_NAME)
adapter_config_path = os.path.join(weights_save_path, "adapter_config.json")
if os.path.exists(adapter_config_path):
ludwig_model.model.merge_and_unload(progressbar=config_obj.adapter.postprocessor.progressbar)
# load train set metadata
ludwig_model.training_set_metadata = backend.broadcast_return(
lambda: load_metadata(os.path.join(model_dir, TRAIN_SET_METADATA_FILE_NAME))
)
return ludwig_model
def load_weights(
self,
model_dir: str,
from_checkpoint: bool = False,
) -> None:
"""Loads weights from a pre-trained model.
# Inputs
:param model_dir: (str) filepath string to location of a pre-trained
model
:param from_checkpoint: (bool, default: `False`) if `True`, the model
will be loaded from the latest checkpoint (training_checkpoints/)
instead of the final model weights.
# Return
:return: `None`
# Example usage
```python
ludwig_model.load_weights(model_dir)
```
"""
if self.backend.is_coordinator():
if from_checkpoint:
with self.backend.create_trainer(
model=self.model,
config=self.config_obj.trainer,
) as trainer:
checkpoint = trainer.create_checkpoint_handle()
training_checkpoints_path = os.path.join(model_dir, TRAINING_CHECKPOINTS_DIR_PATH)
trainer.resume_weights_and_optimizer(training_checkpoints_path, checkpoint)
else:
self.model.load(model_dir)
self.backend.sync_model(self.model)
def save(self, save_path: str) -> None:
"""This function allows to save models on disk.
# Inputs
:param save_path: (str) path to the directory where the model is
going to be saved. Both a JSON file containing the model
architecture hyperparameters and checkpoints files containing
model weights will be saved.
# Return
:return: (None) `None`
# Example usage
```python
ludwig_model.save(save_path)
```
"""
self._check_initialization()
# save config
self.save_config(save_path)
# save model weights
self.model.save(save_path)
# save training set metadata
training_set_metadata_path = os.path.join(save_path, TRAIN_SET_METADATA_FILE_NAME)
save_json(training_set_metadata_path, self.training_set_metadata)
@staticmethod
def upload_to_hf_hub(
repo_id: str,
model_path: str,
repo_type: str = "model",
private: bool = False,
commit_message: str = "Upload trained [Ludwig](https://ludwig.ai/latest/) model weights",
commit_description: str | None = None,
) -> bool:
"""Uploads trained model artifacts to the HuggingFace Hub.
# Inputs
:param repo_id: (`str`)
A namespace (user or an organization) and a repo name separated
by a `/`.
:param model_path: (`str`)
The path of the saved model. This is either (a) the folder where
the 'model_weights' folder and the 'model_hyperparameters.json' file
are stored, or (b) the parent of that folder.
:param private: (`bool`, *optional*, defaults to `False`)
Whether the model repo should be private.
:param repo_type: (`str`, *optional*)
Set to `"dataset"` or `"space"` if uploading to a dataset or
space, `None` or `"model"` if uploading to a model. Default is
`None`.
:param commit_message: (`str`, *optional*)
The summary / title / first line of the generated commit. Defaults to:
`f"Upload {path_in_repo} with huggingface_hub"`
:param commit_description: (`str` *optional*)
The description of the generated commit
# Returns
:return: (bool) True for success, False for failure.
"""
if os.path.exists(os.path.join(model_path, MODEL_FILE_NAME, MODEL_WEIGHTS_FILE_NAME)) and os.path.exists(
os.path.join(model_path, MODEL_FILE_NAME, MODEL_HYPERPARAMETERS_FILE_NAME)
):
experiment_path = model_path
elif os.path.exists(os.path.join(model_path, MODEL_WEIGHTS_FILE_NAME)) and os.path.exists(
os.path.join(model_path, MODEL_HYPERPARAMETERS_FILE_NAME)
):
experiment_path = os.path.dirname(model_path)
else:
raise ValueError(
f"Can't find 'model_weights' and '{MODEL_HYPERPARAMETERS_FILE_NAME}' either at "
f"'{model_path}' or at '{model_path}/model'"
)
model_service = get_upload_registry()["hf_hub"]
hub: HuggingFaceHub = model_service()
hub.login()
upload_status: bool = hub.upload(
repo_id=repo_id,
model_path=experiment_path,
repo_type=repo_type,
private=private,
commit_message=commit_message,
commit_description=commit_description,
)
return upload_status
def save_config(self, save_path: str) -> None:
"""Save config to specified location.
# Inputs
:param save_path: (str) filepath string to save config as a
JSON file.
# Return
:return: `None`
"""
os.makedirs(save_path, exist_ok=True)
model_hyperparameters_path = os.path.join(save_path, MODEL_HYPERPARAMETERS_FILE_NAME)
save_json(model_hyperparameters_path, self.config_obj.to_dict())
def to_torchscript(
self,
model_only: bool = False,
device: TorchDevice | None = None,
):
"""Converts the trained model to Torchscript.
# Inputs
:param model_only (bool, optional): If True, only the ECD model will be converted to Torchscript. Else,
preprocessing and postprocessing steps will also be converted to Torchscript. :param device (TorchDevice,
optional): If None, the model will be converted to Torchscript on the same device to ensure maximum model
parity.
# Returns
:return: A torch.jit.ScriptModule that can be used to predict on a dictionary of inputs.
"""
if device is None:
device = DEVICE
self._check_initialization()
if model_only:
return self.model.to_torchscript(device)
else:
inference_module = InferenceModule.from_ludwig_model(
self.model, self.config_obj.to_dict(), self.training_set_metadata, device=device
)
return torch.jit.script(inference_module)
def save_torchscript(
self,
save_path: str,
model_only: bool = False,
device: TorchDevice | None = None,
):
"""Saves the Torchscript model to disk.
# Inputs
:param save_path (str): The path to the directory where the model will be saved.
:param model_only (bool, optional): If True, only the ECD model will be converted to Torchscript. Else, the
preprocessing and postprocessing steps will also be converted to Torchscript.
:param device (TorchDevice, optional): If None, the model will be converted to Torchscript on the same device to
ensure maximum model parity.
# Return
:return: `None`
"""
if device is None:
device = DEVICE
save_ludwig_model_for_inference(
save_path,
self.model,
self.config_obj.to_dict(),
self.training_set_metadata,
model_only=model_only,
device=device,
)
def _check_initialization(self):
if self.model is None or self._user_config is None or self.training_set_metadata is None:
raise ValueError("Model has not been trained or loaded")
def free_gpu_memory(self):
"""Manually moves the model to CPU to force GPU memory to be freed.
For more context: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/35
"""
if torch.cuda.is_available():
self.model.model.to(torch.device("cpu"))
torch.cuda.empty_cache()
@staticmethod
def create_model(config_obj: ModelConfig | dict, random_seed: int = default_random_seed) -> BaseModel:
"""Instantiates BaseModel object.
# Inputs
:param config_obj: (Union[Config, dict]) Ludwig config object
:param random_seed: (int, default: ludwig default random seed) Random seed used for weights initialization,
splits and any other random function. # Return
:return: (ludwig.models.BaseModel) Instance of the Ludwig model object.
"""
if isinstance(config_obj, dict):
config_obj = ModelConfig.from_dict(config_obj)
model_type = get_from_registry(config_obj.model_type, model_type_registry)
return model_type(config_obj, random_seed=random_seed)
@staticmethod
def set_logging_level(logging_level: int) -> None:
"""Sets level for log messages.
# Inputs
:param logging_level: (int) Set/Update the logging level. Use logging
constants like `logging.DEBUG` , `logging.INFO` and `logging.ERROR`.
# Return
:return: `None`
"""
logging.getLogger("ludwig").setLevel(logging_level)
if logging_level in {logging.WARNING, logging.ERROR, logging.CRITICAL}:
set_disable_progressbar(True)
else:
set_disable_progressbar(False)
@property
def config(self) -> ModelConfigDict:
"""Returns the fully-rendered config of this model including default values."""
return self.config_obj.to_dict()
@config.setter
def config(self, user_config: ModelConfigDict):
"""Updates the config of this model.
WARNING: this can have unexpected results on an already trained model.
"""
self._user_config = user_config
self.config_obj = ModelConfig.from_dict(self._user_config)
def is_merge_and_unload_set(self) -> bool:
"""Check whether the encapsulated model is of type LLM and is configured to merge_and_unload QLoRA weights.
# Return
:return (bool): whether merge_and_unload should be done.
"""
# TODO: In the future, it may be possible to move up the model type check into the BaseModel class.
return self.config_obj.model_type == MODEL_LLM and self.model.is_merge_and_unload_set()
@PublicAPI
def kfold_cross_validate(
num_folds: int,
config: dict | str,
dataset: str = None,
data_format: str = None,
skip_save_training_description: bool = False,
skip_save_training_statistics: bool = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
skip_save_processed_input: bool = False,
skip_save_predictions: bool = False,
skip_save_eval_stats: bool = False,
skip_collect_predictions: bool = False,
skip_collect_overall_stats: bool = False,
output_directory: str = "results",
random_seed: int = default_random_seed,
gpus: str | int | list[int] | None = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
backend: Backend | str | None = None,
logging_level: int = logging.INFO,
**kwargs,
) -> tuple[dict, dict]:
"""Performs k-fold cross validation and returns result data structures.
# Inputs
:param num_folds: (int) number of folds to create for the cross-validation
:param config: (Union[dict, str]) model specification
required to build a model. Parameter may be a dictionary or string
specifying the file path to a yaml configuration file. Refer to the
[User Guide](http://ludwig.ai/user_guide/#model-config)
for details.
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used for k_fold processing.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'df'`, `'dict'`, `'excel'`, `'feather'`,
`'fwf'`,
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`. Currently `hdf5` format is not supported for
k_fold cross validation.
:param skip_save_training_description: (bool, default: `False`) disables
saving the description JSON file.
:param skip_save_training_statistics: (bool, default: `False`) disables
saving training statistics JSON file.
:param skip_save_model: (bool, default: `False`) disables
saving model weights and hyperparameters each time the model
improves. By default Ludwig saves model weights after each epoch
the validation metric improves, but if the model is really big
that can be time consuming. If you do not want to keep
the weights and just find out what performance a model can get
with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on and the returned model
will have the weights obtained at the end of training, instead of
the weights of the epoch with the best validation performance.
:param skip_save_progress: (bool, default: `False`) disables saving
progress each epoch. By default Ludwig saves weights and stats
after each epoch for enabling resuming of training, but if
the model is really big that can be time consuming and will uses
twice as much space, use this parameter to skip it, but training
cannot be resumed later on.
:param skip_save_log: (bool, default: `False`) disables saving TensorBoard
logs. By default Ludwig saves logs for the TensorBoard, but if it
is not needed turning it off can slightly increase the
overall speed.
:param skip_save_processed_input: (bool, default: `False`) if input
dataset is provided it is preprocessed and cached by saving an HDF5
and JSON files to avoid running the preprocessing again. If this
parameter is `False`, the HDF5 and JSON file are not saved.
:param skip_save_predictions: (bool, default: `False`) skips saving test
predictions CSV files.
:param skip_save_eval_stats: (bool, default: `False`) skips saving test
statistics JSON file.
:param skip_collect_predictions: (bool, default: `False`) skips collecting
post-processed predictions during eval.
:param skip_collect_overall_stats: (bool, default: `False`) skips collecting
overall stats during eval.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param random_seed: (int, default: `42`) Random seed
used for weights initialization,
splits and any other random function.
:param gpus: (list, default: `None`) list of GPUs that are available
for training.
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow Torch to
use multithreading parallelism
to improve performance at the cost of determinism.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param logging_level: (int, default: INFO) log level to send to stderr.
# Return
:return: (tuple(kfold_cv_statistics, kfold_split_indices), dict) a tuple of
dictionaries `kfold_cv_statistics`: contains metrics from cv run.
`kfold_split_indices`: indices to split training data into
training fold and test fold.
"""
# if config is a path, convert to dictionary
if isinstance(config, str): # assume path
config = load_yaml(config)
backend = initialize_backend(backend or config.get("backend"))
# check for k_fold
if num_folds is None:
raise ValueError("k_fold parameter must be specified")
logger.info(f"starting {num_folds:d}-fold cross validation")
# create output_directory if not available
if not os.path.isdir(output_directory):
os.mkdir(output_directory)
# prepare data for k-fold processing
# use Ludwig's utility to facilitate creating a dataframe
# that is used as the basis for creating folds
dataset, _, _, _ = load_dataset_uris(dataset, None, None, None, backend)
# determine data format of provided dataset
if not data_format or data_format == "auto":
data_format = figure_data_format(dataset)
data_df = load_dataset(dataset, data_format=data_format, df_lib=backend.df_engine.df_lib)
kfold_cv_stats = {}
kfold_split_indices = {}
for train_indices, test_indices, fold_num in generate_kfold_splits(data_df, num_folds, random_seed):
with tempfile.TemporaryDirectory() as temp_dir_name:
curr_train_df = data_df.iloc[train_indices]
curr_test_df = data_df.iloc[test_indices]
kfold_split_indices["fold_" + str(fold_num)] = {
"training_indices": train_indices,
"test_indices": test_indices,
}
# train and validate model on this fold
logger.info(f"training on fold {fold_num:d}")
model = LudwigModel(
config=config,
logging_level=logging_level,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
)
eval_stats, train_stats, preprocessed_data, output_directory = model.experiment(
training_set=curr_train_df,
test_set=curr_test_df,
experiment_name="cross_validation",
model_name="fold_" + str(fold_num),
skip_save_training_description=skip_save_training_description,
skip_save_training_statistics=skip_save_training_statistics,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
skip_save_processed_input=skip_save_processed_input,
skip_save_predictions=skip_save_predictions,
skip_save_eval_stats=skip_save_eval_stats,
skip_collect_predictions=skip_collect_predictions,
skip_collect_overall_stats=skip_collect_overall_stats,
output_directory=os.path.join(temp_dir_name, "results"),
random_seed=random_seed,
)
# augment the training statistics with scoring metric from
# the hold out fold
if dataclasses.is_dataclass(train_stats):
train_stats_dict = dataclasses.asdict(train_stats)
elif hasattr(train_stats, "to_dict"):
train_stats_dict = train_stats.to_dict()
else:
train_stats_dict = vars(train_stats)
train_stats_dict["fold_eval_stats"] = eval_stats
# collect training statistics for this fold
kfold_cv_stats["fold_" + str(fold_num)] = train_stats_dict
# consolidate raw fold metrics across all folds
raw_kfold_stats = {}
for fold_name in kfold_cv_stats:
curr_fold_eval_stats = kfold_cv_stats[fold_name]["fold_eval_stats"]
for of_name in curr_fold_eval_stats:
if of_name not in raw_kfold_stats:
raw_kfold_stats[of_name] = {}
fold_eval_stats_of = curr_fold_eval_stats[of_name]
for metric in fold_eval_stats_of:
if metric not in {
"predictions",
"probabilities",
"confusion_matrix",
"overall_stats",
"per_class_stats",
"roc_curve",
"precision_recall_curve",
}:
if metric not in raw_kfold_stats[of_name]:
raw_kfold_stats[of_name][metric] = []
raw_kfold_stats[of_name][metric].append(fold_eval_stats_of[metric])
# calculate overall kfold statistics
overall_kfold_stats = {}
for of_name in raw_kfold_stats:
overall_kfold_stats[of_name] = {}
for metric in raw_kfold_stats[of_name]:
mean = np.mean(raw_kfold_stats[of_name][metric])
std = np.std(raw_kfold_stats[of_name][metric])
overall_kfold_stats[of_name][metric + "_mean"] = mean
overall_kfold_stats[of_name][metric + "_std"] = std
kfold_cv_stats["overall"] = overall_kfold_stats
logger.info(f"completed {num_folds:d}-fold cross validation")
return kfold_cv_stats, kfold_split_indices
def _get_compute_description(backend) -> dict:
"""Returns the compute description for the backend."""
compute_description = {"num_nodes": backend.num_nodes}
if torch.cuda.is_available():
# Assumption: All nodes are of the same instance type.
# TODO: fix for Ray where workers may be of different skus
compute_description.update(
{
"gpus_per_node": torch.cuda.device_count(),
"arch_list": torch.cuda.get_arch_list(),
"gencode_flags": torch.cuda.get_gencode_flags(),
"devices": {},
}
)
for i in range(torch.cuda.device_count()):
compute_description["devices"][i] = {
"gpu_type": torch.cuda.get_device_name(i),
"device_capability": torch.cuda.get_device_capability(i),
"device_properties": str(torch.cuda.get_device_properties(i)),
}
return compute_description
@PublicAPI
def get_experiment_description(
config,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
data_format=None,
backend=None,
random_seed=None,
):
description = OrderedDict()
description["ludwig_version"] = LUDWIG_VERSION
description["command"] = " ".join(sys.argv)
commit_hash = get_commit_hash()
if commit_hash is not None:
description["commit_hash"] = commit_hash[:12]
if random_seed is not None:
description["random_seed"] = random_seed
if isinstance(dataset, str):
description["dataset"] = dataset
if isinstance(training_set, str):
description["training_set"] = training_set
if isinstance(validation_set, str):
description["validation_set"] = validation_set
if isinstance(test_set, str):
description["test_set"] = test_set
if training_set_metadata is not None:
description["training_set_metadata"] = training_set_metadata
# determine data format if not provided or auto
if not data_format or data_format == "auto":
data_format = figure_data_format(dataset, training_set, validation_set, test_set)
if data_format:
description["data_format"] = str(data_format)
description["config"] = config
description["torch_version"] = torch.__version__
description["compute"] = _get_compute_description(backend)
return description
================================================
FILE: ludwig/api_annotations.py
================================================
def PublicAPI(*args, **kwargs):
"""Annotation for documenting public APIs. Public APIs are classes and methods exposed to end users of Ludwig.
If stability="stable", the APIs will remain backwards compatible across minor Ludwig releases
(e.g., Ludwig 0.6 -> Ludwig 0.7).
If stability="experimental", the APIs can be used by advanced users who are tolerant to and expect
breaking changes. This will likely be seen in the case of incremental new feature development.
Args:
stability: One of {"stable", "experimental"}
Examples:
>>> from api_annotations import PublicAPI
>>> @PublicAPI
... def func1(x):
... return x
>>> @PublicAPI(stability="experimental")
... def func2(y):
... return y
"""
if len(args) == 1 and len(kwargs) == 0 and callable(args[0]):
return PublicAPI(stability="stable")(args[0])
if "stability" in kwargs:
stability = kwargs["stability"]
assert stability in ["stable", "experimental"], stability
elif kwargs:
raise ValueError(f"Unknown kwargs: {kwargs.keys()}")
else:
stability = "stable"
def wrap(obj):
if stability == "experimental":
message = f"PublicAPI ({stability}): This API is {stability} and may change before becoming stable."
else:
message = "PublicAPI: This API is stable across Ludwig releases."
_append_doc(obj, message=message)
_mark_annotated(obj)
return obj
return wrap
def DeveloperAPI(*args, **kwargs):
"""Annotation for documenting developer APIs. Developer APIs are lower-level methods explicitly exposed to
advanced Ludwig users and library developers. Their interfaces may change across minor Ludwig releases (for
e.g., Ludwig 0.6.1 and Ludwig 0.6.2).
Examples:
>>> from api_annotations import DeveloperAPI
>>> @DeveloperAPI
... def func(x):
... return x
"""
if len(args) == 1 and len(kwargs) == 0 and callable(args[0]):
return DeveloperAPI()(args[0])
def wrap(obj):
_append_doc(obj, message="DeveloperAPI: This API may change across minor Ludwig releases.")
_mark_annotated(obj)
return obj
return wrap
def Deprecated(*args, **kwargs):
"""Annotation for documenting a deprecated API. Deprecated APIs may be removed in future releases of Ludwig
(e.g., Ludwig 0.7 to Ludwig 0.8).
Args:
message: A message to help users understand the reason for the deprecation, and provide a migration path.
Examples:
>>> from api_annotations import Deprecated
>>> @Deprecated
... def func(x):
... return x
>>> @Deprecated(message="g() is deprecated because the API is error prone. Please call h() instead.")
... def g(y):
... return y
"""
if len(args) == 1 and len(kwargs) == 0 and callable(args[0]):
return Deprecated()(args[0])
message = "**DEPRECATED:** This API is deprecated and may be removed in a future Ludwig release."
if "message" in kwargs:
message += " " + kwargs["message"]
del kwargs["message"]
if kwargs:
raise ValueError(f"Unknown kwargs: {kwargs.keys()}")
def inner(obj):
_append_doc(obj, message=message, directive="warning")
_mark_annotated(obj)
return obj
return inner
def _append_doc(obj, message: str, directive: str | None = None) -> str:
"""
Args:
message: An additional message to append to the end of docstring for a class
or method that uses one of the API annotations
directive: A shorter message that provides contexts for the message and indents it.
For example, this could be something like 'warning' or 'info'.
"""
if not obj.__doc__:
obj.__doc__ = ""
obj.__doc__ = obj.__doc__.rstrip()
indent = _get_indent(obj.__doc__)
obj.__doc__ += "\n\n"
if directive is not None:
obj.__doc__ += f"{' ' * indent}.. {directive}::\n"
obj.__doc__ += f"{' ' * (indent + 4)}{message}"
else:
obj.__doc__ += f"{' ' * indent}{message}"
obj.__doc__ += f"\n{' ' * indent}"
def _mark_annotated(obj) -> None:
# Set magic token for check_api_annotations linter.
if hasattr(obj, "__name__"):
obj._annotated = obj.__name__
def _is_annotated(obj) -> bool:
# Check the magic token exists and applies to this class (not a subclass).
return hasattr(obj, "_annotated") and obj._annotated == obj.__name__
def _get_indent(docstring: str) -> int:
"""
Example:
>>> def f():
... '''Docstring summary.'''
>>> f.__doc__
'Docstring summary.'
>>> _get_indent(f.__doc__)
0
>>> def g(foo):
... '''Docstring summary.
...
... Args:
... foo: Does bar.
... '''
>>> g.__doc__
'Docstring summary.\\n\\n Args:\\n foo: Does bar.\\n '
>>> _get_indent(g.__doc__)
4
>>> class A:
... def h():
... '''Docstring summary.
...
... Returns:
... None.
... '''
>>> A.h.__doc__
'Docstring summary.\\n\\n Returns:\\n None.\\n '
>>> _get_indent(A.h.__doc__)
8
"""
if not docstring:
return 0
non_empty_lines = list(filter(bool, docstring.splitlines()))
if len(non_empty_lines) == 1:
# Docstring contains summary only.
return 0
# The docstring summary isn't indented, so check the indentation of the second non-empty line.
return len(non_empty_lines[1]) - len(non_empty_lines[1].lstrip())
================================================
FILE: ludwig/automl/__init__.py
================================================
from ludwig.automl.automl import auto_train # noqa
from ludwig.automl.automl import cli_init_config # noqa
from ludwig.automl.automl import create_auto_config # noqa
from ludwig.automl.automl import train_with_config # noqa; noqa
================================================
FILE: ludwig/automl/auto_tune_config.py
================================================
import copy
import logging
import math
from collections import OrderedDict
import psutil
try:
import GPUtil
except ImportError:
raise ImportError("GPUtil is not installed. In order to use auto_train please run pip install ludwig[ray]")
from ludwig.api import LudwigModel
from ludwig.backend import initialize_backend
from ludwig.constants import (
AUTO,
AUTOML_DEFAULT_TEXT_ENCODER,
AUTOML_LARGE_TEXT_DATASET,
AUTOML_MAX_ROWS_PER_CHECKPOINT,
AUTOML_SMALLER_TEXT_ENCODER,
AUTOML_SMALLER_TEXT_LENGTH,
AUTOML_TEXT_ENCODER_MAX_TOKEN_LEN,
HYPEROPT,
MINIMUM_BATCH_SIZE,
PREPROCESSING,
SPACE,
TEXT,
TRAINER,
)
from ludwig.data.preprocessing import preprocess_for_training
from ludwig.features.feature_registries import update_config_with_metadata
from ludwig.schema.model_config import ModelConfig
from ludwig.utils.automl.utils import get_model_type
from ludwig.utils.torch_utils import initialize_pytorch
logger = logging.getLogger(__name__)
# maps variable search space that can be modified to minimum permissible value for the range
RANKED_MODIFIABLE_PARAM_LIST = {
"tabnet": OrderedDict(
{
"trainer.batch_size": 32,
"combiner.size": 8,
"combiner.output_size": 8,
}
),
"concat": OrderedDict(
{
"trainer.batch_size": 32,
"combiner.output_size": 64,
"combiner.num_fc_layers": 1,
}
),
"tabtransformer": OrderedDict(
{
"trainer.batch_size": 32,
"combiner.num_heads:": 4,
"combiner.output_size": 8,
"combiner.num_layers": 4,
"combiner.num_fc_layers": 1,
}
),
"text": OrderedDict( # for single input feature text models e.g. bert and its variants
{
"trainer.batch_size": 16,
}
),
}
BYTES_PER_MiB = 1048576
BYTES_PER_WEIGHT = 4 # assumes 32-bit precision = 4 bytes
BYTES_OPTIMIZER_PER_WEIGHT = 8 # for optimizer m and v vectors
def get_trainingset_metadata(config, dataset, backend):
_, _, _, training_set_metadata = preprocess_for_training(
config, dataset=dataset, preprocessing_params=config[PREPROCESSING], backend=backend
)
return training_set_metadata
# Note: if run in Ray Cluster, this method is run remote with gpu resources requested if available
def _get_machine_memory():
if GPUtil.getGPUs():
machine_mem = GPUtil.getGPUs()[0].memoryTotal * BYTES_PER_MiB
else:
machine_mem = psutil.virtual_memory().total
return machine_mem
def _get_text_feature_max_length(config, training_set_metadata) -> int:
"""Returns max sequence length over text features, subject to preprocessing limit."""
max_length = 0
for feature in config["input_features"]:
if feature["type"] == TEXT:
feature_max_len = training_set_metadata[feature["name"]]["max_sequence_length"]
if feature_max_len > max_length:
max_length = feature_max_len
if (
("preprocessing" in config)
and (TEXT in config["preprocessing"])
and ("max_sequence_length" in config["preprocessing"][TEXT])
):
limit = config["preprocessing"][TEXT]["max_sequence_length"]
else:
limit = 256 # Preprocessing default max_sequence_length = 256
if max_length > limit + 2: # For start and stop symbols.
max_length = limit + 2
return max_length
def _get_text_model_memory_usage(config, training_set_metadata, memory_usage) -> int:
max_feature_token_length = _get_text_feature_max_length(config, training_set_metadata)
memory_usage = (memory_usage / AUTOML_TEXT_ENCODER_MAX_TOKEN_LEN) * max_feature_token_length
return memory_usage
def compute_memory_usage(config_obj, training_set_metadata, model_category) -> int:
update_config_with_metadata(config_obj, training_set_metadata)
lm = LudwigModel.create_model(config_obj)
model_size = lm.get_model_size() # number of parameters in model
batch_size = config_obj.trainer.batch_size
if batch_size == AUTO:
# Smallest valid batch size that will allow training to complete
batch_size = MINIMUM_BATCH_SIZE
memory_usage = model_size * (BYTES_PER_WEIGHT + BYTES_OPTIMIZER_PER_WEIGHT) * batch_size
if model_category == TEXT:
return _get_text_model_memory_usage(config_obj.to_dict(), training_set_metadata, memory_usage)
else:
return memory_usage
def sub_new_params(config: dict, new_param_vals: dict):
new_config = copy.deepcopy(config)
for param, val in new_param_vals.items():
config_section = param.split(".")[0]
param_name = param.split(".")[1]
new_config[config_section][param_name] = val
return new_config
def get_new_params(current_param_values, hyperparam_search_space, params_to_modify):
for param, _ in params_to_modify.items():
if param in hyperparam_search_space:
if hyperparam_search_space[param][SPACE] == "choice":
current_param_values[param] = hyperparam_search_space[param]["categories"][-1]
else:
current_param_values[param] = hyperparam_search_space[param]["upper"]
return current_param_values
def _update_text_encoder(input_features: list, old_text_encoder: str, new_text_encoder: str) -> None:
for feature in input_features:
if feature["type"] == TEXT and feature["encoder"] == old_text_encoder:
feature["encoder"] = new_text_encoder
def _get_text_feature_min_usable_length(input_features: list, training_set_metadata) -> int:
"""Returns min of AUTOML_SMALLER_TEXT_LENGTH and lowest 99th percentile sequence length over text features."""
min_usable_length = AUTOML_SMALLER_TEXT_LENGTH
for feature in input_features:
if feature["type"] == TEXT:
feature_99ptile_len = training_set_metadata[feature["name"]]["max_sequence_length_99ptile"]
if feature_99ptile_len < min_usable_length:
min_usable_length = feature_99ptile_len
return round(min_usable_length)
def reduce_text_feature_max_length(config, training_set_metadata) -> bool:
"""Reduce max sequence length, when viable, to control its quadratic impact."""
input_features = config["input_features"]
min_usable_length = _get_text_feature_min_usable_length(input_features, training_set_metadata)
seq_len_limit = {"max_sequence_length": min_usable_length}
if "preprocessing" not in config:
config["preprocessing"] = {TEXT: seq_len_limit}
elif (
(TEXT not in config["preprocessing"])
or ("max_sequence_length" not in config["preprocessing"][TEXT])
or (min_usable_length < float(config["preprocessing"][TEXT]["max_sequence_length"]))
):
config["preprocessing"][TEXT] = seq_len_limit
else:
return False
return True
# For hyperparam_search_space comprised solely of choice spaces, compute maximum number of
# combinations and return that value if it is less than num_samples; else return num_samples.
def _update_num_samples(num_samples, hyperparam_search_space):
max_num_samples = 1
for param in hyperparam_search_space.keys():
if hyperparam_search_space[param][SPACE] == "choice":
max_num_samples *= len(hyperparam_search_space[param]["categories"])
else:
return num_samples
if max_num_samples < num_samples:
return max_num_samples
return num_samples
# Note: if run in Ray Cluster, this method is run remote with gpu resources requested if available
def memory_tune_config(config, dataset, model_category, row_count, backend):
backend = initialize_backend(backend)
fits_in_memory = False
tried_reduce_seq_len = False
config_obj = ModelConfig.from_dict(config)
raw_config = config_obj.to_dict()
training_set_metadata = get_trainingset_metadata(raw_config, dataset, backend)
modified_hyperparam_search_space = copy.deepcopy(raw_config[HYPEROPT]["parameters"])
current_param_values = {}
param_list = []
model_type = get_model_type(raw_config)
if model_type in RANKED_MODIFIABLE_PARAM_LIST:
params_to_modify = RANKED_MODIFIABLE_PARAM_LIST[model_type]
if len(params_to_modify.keys()) > 0:
param_list = list(params_to_modify.keys())
max_memory = _get_machine_memory()
initialize_pytorch()
while param_list:
# compute memory utilization
current_param_values = get_new_params(current_param_values, modified_hyperparam_search_space, params_to_modify)
temp_config = sub_new_params(raw_config, current_param_values)
config_obj = ModelConfig.from_dict(temp_config)
mem_use = compute_memory_usage(config_obj, training_set_metadata, model_category)
if mem_use > max_memory and model_category == TEXT and not tried_reduce_seq_len:
tried_reduce_seq_len = True
if reduce_text_feature_max_length(config, training_set_metadata):
reduce_text_feature_max_length(temp_config, training_set_metadata)
config_obj = ModelConfig.from_dict(temp_config)
mem_use = compute_memory_usage(config_obj, training_set_metadata, model_category)
logger.info(f"Checking model estimated mem use {mem_use} against memory size {max_memory}")
if mem_use <= max_memory:
fits_in_memory = True
break
# check if we have exhausted tuning of current param (e.g. we can no longer reduce the param value)
param, min_value = param_list[0], params_to_modify[param_list[0]]
if param in modified_hyperparam_search_space.keys():
param_space = modified_hyperparam_search_space[param]["space"]
if param_space == "choice":
if (
len(modified_hyperparam_search_space[param]["categories"]) >= 2
and modified_hyperparam_search_space[param]["categories"][-2] >= min_value
):
modified_hyperparam_search_space[param]["categories"] = modified_hyperparam_search_space[param][
"categories"
][:-1]
else:
param_list.pop(0) # exhausted reduction of this parameter
else:
# reduce by 10%
upper_bound, lower_bound = (
modified_hyperparam_search_space[param]["upper"],
modified_hyperparam_search_space[param]["lower"],
)
reduction_val = (upper_bound - lower_bound) * 0.1
new_upper_bound = upper_bound - reduction_val
if (new_upper_bound) > lower_bound and new_upper_bound > min_value:
modified_hyperparam_search_space[param]["upper"] = new_upper_bound
else:
param_list.pop(0) # exhausted reduction of this parameter
else:
param_list.pop(0) # param not in hyperopt search space
if model_category == TEXT and row_count > AUTOML_LARGE_TEXT_DATASET:
if "checkpoints_per_epoch" not in config[TRAINER] and "steps_per_checkpoint" not in config[TRAINER]:
checkpoints_per_epoch = max(2, math.floor(row_count / AUTOML_MAX_ROWS_PER_CHECKPOINT))
config[TRAINER][
"checkpoints_per_epoch"
] = checkpoints_per_epoch # decrease latency to get model accuracy signal
if "evaluate_training_set" not in config[TRAINER]:
config[TRAINER]["evaluate_training_set"] = False # reduce overhead for increased evaluation frequency
if not fits_in_memory:
# Switch to smaller pre-trained model encoder for large datasets.
_update_text_encoder(config["input_features"], AUTOML_DEFAULT_TEXT_ENCODER, AUTOML_SMALLER_TEXT_ENCODER)
modified_config = copy.deepcopy(config)
modified_config[HYPEROPT]["parameters"] = modified_hyperparam_search_space
modified_config[HYPEROPT]["executor"]["num_samples"] = _update_num_samples(
modified_config[HYPEROPT]["executor"]["num_samples"], modified_hyperparam_search_space
)
return modified_config, fits_in_memory
================================================
FILE: ludwig/automl/automl.py
================================================
"""automl.py.
Driver script which:
(1) Builds a base config by performing type inference and populating config
w/default combiner parameters, training parameters, and hyperopt search space
(2) Tunes config based on resource constraints
(3) Runs hyperparameter optimization experiment
"""
import argparse
import copy
import logging
import os
import warnings
from typing import Any
import numpy as np
import pandas as pd
import yaml
from ludwig.api import LudwigModel
from ludwig.api_annotations import PublicAPI
from ludwig.automl.base_config import (
create_default_config,
DatasetInfo,
get_dataset_info,
get_features_config,
get_reference_configs,
)
from ludwig.backend import Backend, initialize_backend
from ludwig.constants import (
AUTO,
AUTOML_DEFAULT_IMAGE_ENCODER,
AUTOML_DEFAULT_TABULAR_MODEL,
AUTOML_DEFAULT_TEXT_ENCODER,
BINARY,
CATEGORY,
ENCODER,
HYPEROPT,
IMAGE,
INPUT_FEATURES,
NAME,
NUMBER,
OUTPUT_FEATURES,
TABULAR,
TEXT,
TRAINER,
TYPE,
)
from ludwig.contrib import add_contrib_callback_args
from ludwig.data.cache.types import CacheableDataset
from ludwig.datasets import load_dataset_uris
from ludwig.globals import LUDWIG_VERSION, MODEL_FILE_NAME
from ludwig.hyperopt.run import hyperopt
from ludwig.schema.model_config import ModelConfig
from ludwig.types import ModelConfigDict
from ludwig.utils.automl.ray_utils import _ray_init
from ludwig.utils.automl.utils import _add_transfer_config, get_model_type, set_output_feature_metric
from ludwig.utils.data_utils import load_dataset, use_credentials
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.fs_utils import open_file
from ludwig.utils.heuristics import get_auto_learning_rate
from ludwig.utils.misc_utils import merge_dict
from ludwig.utils.print_utils import print_ludwig
try:
import dask.dataframe as dd
from ray.tune import ExperimentAnalysis
except ImportError as e:
raise RuntimeError("ray is not installed. In order to use auto_train please run pip install ludwig[ray]") from e
logger = logging.getLogger(__name__)
OUTPUT_DIR = "."
TABULAR_TYPES = {CATEGORY, NUMBER, BINARY}
class AutoTrainResults:
def __init__(self, experiment_analysis: ExperimentAnalysis, creds: dict[str, Any] = None):
self._experiment_analysis = experiment_analysis
self._creds = creds
@property
def experiment_analysis(self):
return self._experiment_analysis
@property
def best_trial_id(self) -> str:
return self._experiment_analysis.best_trial.trial_id
@property
def best_model(self) -> LudwigModel | None:
checkpoint = self._experiment_analysis.best_checkpoint
if checkpoint is None:
logger.warning("No best model found")
return None
# Use credentials context for remote checkpoints that may need custom auth
with use_credentials(self._creds):
with checkpoint.as_directory() as ckpt_path:
model_dir = os.path.join(ckpt_path, MODEL_FILE_NAME)
if not os.path.isdir(model_dir):
logger.warning(
f"Best checkpoint does not contain model files at {model_dir}. "
"The trial may not have completed a full training epoch."
)
return None
# Ray Tune checkpoints contain training_checkpoints/ (from
# mid-training saves) but not model_weights (only saved after
# training completes). Load from the training checkpoint.
return LudwigModel.load(model_dir, from_checkpoint=True)
@PublicAPI
def auto_train(
dataset: str | pd.DataFrame | dd.DataFrame,
target: str,
time_limit_s: int | float,
output_directory: str = OUTPUT_DIR,
tune_for_memory: bool = False,
user_config: dict = None,
random_seed: int = default_random_seed,
use_reference_config: bool = False,
**kwargs,
) -> AutoTrainResults:
"""Main auto train API that first builds configs for each model type (e.g. concat, tabnet, transformer). Then
selects model based on dataset attributes. And finally runs a hyperparameter optimization experiment.
All batch and learning rate tuning is done @ training time.
# Inputs
:param dataset: (str, pd.DataFrame, dd.DataFrame) data source to train over.
:param target: (str) name of target feature
:param time_limit_s: (int, float) total time allocated to auto_train. acts
as the stopping parameter
:param output_directory: (str) directory into which to write results, defaults to
current working directory.
:param tune_for_memory: (bool) refine hyperopt search space for available
host / GPU memory
:param user_config: (dict) override automatic selection of specified config items
:param random_seed: (int, default: `42`) a random seed that will be used anywhere
there is a call to a random number generator, including
hyperparameter search sampling, as well as data splitting,
parameter initialization and training set shuffling
:param use_reference_config: (bool) refine hyperopt search space by setting first
search point from reference model config, if any
:param kwargs: additional keyword args passed down to `ludwig.hyperopt.run.hyperopt`.
# Returns
:return: (AutoTrainResults) results containing hyperopt experiments and best model
"""
config = create_auto_config(
dataset,
target,
time_limit_s,
tune_for_memory,
user_config,
random_seed,
use_reference_config=use_reference_config,
)
return train_with_config(dataset, config, output_directory=output_directory, random_seed=random_seed, **kwargs)
@PublicAPI
def create_auto_config(
dataset: str | pd.DataFrame | dd.DataFrame | DatasetInfo,
target: str | list[str],
time_limit_s: int | float,
tune_for_memory: bool = False,
user_config: dict = None,
random_seed: int = default_random_seed,
imbalance_threshold: float = 0.9,
use_reference_config: bool = False,
backend: Backend | str = None,
) -> ModelConfigDict:
"""Returns an auto-generated Ludwig config with the intent of training the best model on given given dataset /
target in the given time limit.
# Inputs
:param dataset: (str, pd.DataFrame, dd.DataFrame, DatasetInfo) data source to train over.
:param target: (str, List[str]) name of target feature
:param time_limit_s: (int, float) total time allocated to auto_train. acts
as the stopping parameter
:param tune_for_memory: (bool) DEPRECATED refine hyperopt search space for available
host / GPU memory
:param user_config: (dict) override automatic selection of specified config items
:param random_seed: (int, default: `42`) a random seed that will be used anywhere
there is a call to a random number generator, including
hyperparameter search sampling, as well as data splitting,
parameter initialization and training set shuffling
:param imbalance_threshold: (float) maximum imbalance ratio (minority / majority) to perform stratified sampling
:param use_reference_config: (bool) refine hyperopt search space by setting first
search point from reference model config, if any
# Return
:return: (dict) selected model configuration
"""
backend = initialize_backend(backend)
if not isinstance(dataset, DatasetInfo):
# preload ludwig datasets
dataset, _, _, _ = load_dataset_uris(dataset, None, None, None, backend)
if isinstance(dataset, CacheableDataset):
dataset = dataset.unwrap()
dataset = load_dataset(dataset, df_lib=backend.df_engine.df_lib)
dataset_info = get_dataset_info(dataset) if not isinstance(dataset, DatasetInfo) else dataset
features_config = create_features_config(dataset_info, target)
return create_automl_config_for_features(
features_config,
dataset_info,
target,
time_limit_s=time_limit_s,
user_config=user_config,
random_seed=random_seed,
imbalance_threshold=imbalance_threshold,
use_reference_config=use_reference_config,
backend=backend,
)
@PublicAPI
def create_automl_config_for_features(
features_config: ModelConfigDict,
dataset_info: DatasetInfo,
target: str | list[str],
time_limit_s: int | float,
tune_for_memory: bool = False,
user_config: dict = None,
random_seed: int = default_random_seed,
imbalance_threshold: float = 0.9,
use_reference_config: bool = False,
backend: Backend | str = None,
) -> ModelConfigDict:
default_configs = create_default_config(
features_config, dataset_info, target, time_limit_s, random_seed, imbalance_threshold, backend
)
model_config, _, _ = _model_select(dataset_info, default_configs, user_config, use_reference_config)
if tune_for_memory:
warnings.warn("`tune_for_memory=True` is deprecated, `batch_size=auto` will be used instead")
return model_config
@PublicAPI
def create_features_config(
dataset_info: DatasetInfo,
target_name: str | list[str] = None,
) -> ModelConfigDict:
return get_features_config(dataset_info.fields, dataset_info.row_count, target_name)
@PublicAPI
def train_with_config(
dataset: str | pd.DataFrame | dd.DataFrame,
config: dict,
output_directory: str = OUTPUT_DIR,
random_seed: int = default_random_seed,
**kwargs,
) -> AutoTrainResults:
"""Performs hyperparameter optimization with respect to the given config and selects the best model.
# Inputs
:param dataset: (str) filepath to dataset.
:param config: (dict) optional Ludwig configuration to use for training, defaults
to `create_auto_config`.
:param output_directory: (str) directory into which to write results, defaults to
current working directory.
:param random_seed: (int, default: `42`) a random seed that will be used anywhere
there is a call to a random number generator, including
hyperparameter search sampling, as well as data splitting,
parameter initialization and training set shuffling
:param kwargs: additional keyword args passed down to `ludwig.hyperopt.run.hyperopt`.
# Returns
:return: (AutoTrainResults) results containing hyperopt experiments and best model
"""
_ray_init()
model_type = get_model_type(config)
hyperopt_results = _train(
config, dataset, output_directory=output_directory, model_name=model_type, random_seed=random_seed, **kwargs
)
# catch edge case where metric_score is nan
# TODO (ASN): Decide how we want to proceed if at least one trial has
# completed
for trial in hyperopt_results.ordered_trials:
if isinstance(trial.metric_score, str) or np.isnan(trial.metric_score):
warnings.warn(
"There was an error running the experiment. "
"A trial failed to start. "
"Consider increasing the time budget for experiment. "
)
# Extract credentials needed to pull artifacts, if provided
creds = None
backend: Backend = initialize_backend(kwargs.get("backend"))
if backend is not None:
creds = backend.storage.artifacts.credentials
experiment_analysis = hyperopt_results.experiment_analysis
return AutoTrainResults(experiment_analysis, creds)
def _model_select(
dataset_info: DatasetInfo,
default_configs,
user_config,
use_reference_config: bool,
):
"""Performs model selection based on dataset or user specified model.
Note: Current implementation returns tabnet by default for tabular datasets.
"""
fields = dataset_info.fields
base_config = copy.deepcopy(default_configs["base_config"])
model_category = None
input_features = default_configs["base_config"]["input_features"]
# tabular dataset heuristics
if len(fields) > 3 and all(f[TYPE] in TABULAR_TYPES for f in input_features):
model_category = TABULAR
base_config = merge_dict(base_config, default_configs["combiner"][AUTOML_DEFAULT_TABULAR_MODEL])
# override combiner heuristic if explicitly provided by user
if user_config is not None:
if "combiner" in user_config.keys():
model_type = user_config["combiner"]["type"]
base_config = merge_dict(base_config, default_configs["combiner"][model_type])
else:
# text heuristics
for i, input_feature in enumerate(input_features):
base_config_input_feature = base_config["input_features"][i]
# default text encoder is bert
if input_feature[TYPE] == TEXT:
model_category = TEXT
if ENCODER in input_feature:
base_config_input_feature[ENCODER][TYPE] = AUTOML_DEFAULT_TEXT_ENCODER
else:
base_config_input_feature[ENCODER] = {TYPE: AUTOML_DEFAULT_TEXT_ENCODER}
# TODO(shreya): Should this hyperopt config param be set here?
base_config[HYPEROPT]["executor"]["num_samples"] = 5 # set for small hyperparameter search space
base_config = merge_dict(base_config, default_configs[TEXT][AUTOML_DEFAULT_TEXT_ENCODER])
# TODO (ASN): add image heuristics
if input_feature[TYPE] == IMAGE:
model_category = IMAGE
if ENCODER in input_feature:
base_config_input_feature[ENCODER][TYPE] = AUTOML_DEFAULT_IMAGE_ENCODER
else:
base_config_input_feature[ENCODER] = {TYPE: AUTOML_DEFAULT_IMAGE_ENCODER}
# Merge combiner config
base_config = merge_dict(base_config, default_configs["combiner"]["concat"])
# Adjust learning rate based on other config settings
if base_config[TRAINER]["learning_rate"] == AUTO:
# Add a fake output feature to ensure we can load the ModelConfig, as we expect there to be at least
# one output feature in all cases
# TODO(travis): less hacky way to do this, we should probably allow ModelConfig to be created without output
# features
load_config = copy.deepcopy(base_config)
if not load_config.get(OUTPUT_FEATURES):
load_config[OUTPUT_FEATURES] = [{"name": "fake", "type": "binary"}]
base_config[TRAINER]["learning_rate"] = get_auto_learning_rate(ModelConfig.from_dict(load_config))
# override and constrain automl config based on user specified values
if user_config is not None:
base_config = merge_dict(base_config, user_config)
# remove all parameters from hyperparameter search that user has
# provided explicit values for
hyperopt_params = copy.deepcopy(base_config["hyperopt"]["parameters"])
for hyperopt_params in hyperopt_params.keys():
config_section, param = hyperopt_params.split(".")[0], hyperopt_params.split(".")[1]
if config_section in user_config.keys():
if param in user_config[config_section]:
del base_config["hyperopt"]["parameters"][hyperopt_params]
# if single output feature, set relevant metric and goal if not already set
base_config = set_output_feature_metric(base_config)
# add as initial trial in the automl search the hyperparameter settings from
# the best model for a similar dataset and matching model type, if any.
if use_reference_config:
ref_configs = get_reference_configs()
base_config = _add_transfer_config(base_config, ref_configs)
return base_config, model_category, dataset_info.row_count
def _train(
config: dict,
dataset: str | pd.DataFrame | dd.DataFrame,
output_directory: str,
model_name: str,
random_seed: int,
**kwargs,
):
hyperopt_results = hyperopt(
config,
dataset=dataset,
output_directory=output_directory,
model_name=model_name,
random_seed=random_seed,
skip_save_log=True, # avoid per-step log overhead by default
**kwargs,
)
return hyperopt_results
def init_config(
dataset: str,
target: str | list[str],
time_limit_s: int | float,
tune_for_memory: bool = False,
suggested: bool = False,
hyperopt: bool = False,
output: str = None,
random_seed: int = default_random_seed,
use_reference_config: bool = False,
**kwargs,
):
config = create_auto_config(
dataset=dataset,
target=target,
time_limit_s=time_limit_s,
random_seed=random_seed,
use_reference_config=use_reference_config,
tune_for_memory=tune_for_memory,
)
if HYPEROPT in config and not hyperopt:
del config[HYPEROPT]
if not suggested:
# Only use inputs and outputs
minimal_config = {
INPUT_FEATURES: [{"name": f[NAME], "type": f[TYPE]} for f in config[INPUT_FEATURES]],
OUTPUT_FEATURES: [{"name": f[NAME], "type": f[TYPE]} for f in config[OUTPUT_FEATURES]],
}
if hyperopt:
minimal_config[HYPEROPT] = config[HYPEROPT]
config = minimal_config
if output is None:
print(yaml.safe_dump(config, None, sort_keys=False))
else:
with open_file(output, "w") as f:
yaml.safe_dump(config, f, sort_keys=False)
def cli_init_config(sys_argv):
parser = argparse.ArgumentParser(
description="This script initializes a valid config from a dataset.",
prog="ludwig init_config",
usage="%(prog)s [options]",
)
parser.add_argument(
"-d",
"--dataset",
type=str,
help="input data file path",
)
parser.add_argument(
"-t",
"--target",
type=str,
help="target(s) to predict as output features of the model",
action="append",
required=False,
)
parser.add_argument(
"--time_limit_s",
type=int,
help="time limit to train the model in seconds when using hyperopt",
required=False,
)
parser.add_argument(
"--suggested",
type=bool,
help="use suggested config from automl, otherwise only use inferred types and return a minimal config",
default=False,
required=False,
)
parser.add_argument(
"--hyperopt",
type=bool,
help="include automl hyperopt config",
default=False,
required=False,
)
parser.add_argument(
"--random_seed",
type=int,
help="seed for random number generators used in hyperopt to improve repeatability",
required=False,
)
parser.add_argument(
"--use_reference_config",
type=bool,
help="refine hyperopt search space by setting first search point from stored reference model config",
default=False,
required=False,
)
parser.add_argument(
"-o",
"--output",
type=str,
help="output initialized YAML config path",
required=False,
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("init_config", *sys_argv)
print_ludwig("Init Config", LUDWIG_VERSION)
init_config(**vars(args))
================================================
FILE: ludwig/automl/base_config.py
================================================
"""Uses heuristics to build ludwig configuration file:
(1) infer types based on dataset
(2) populate with
- default combiner parameters,
- preprocessing parameters,
- combiner specific default training parameters,
- combiner specific hyperopt space
- feature parameters
(3) add machineresources
(base implementation -- # CPU, # GPU)
"""
import logging
import os
from dataclasses import dataclass
from typing import Any
import dask.dataframe as dd
import numpy as np
import pandas as pd
import yaml
from dataclasses_json import dataclass_json, LetterCase
from tqdm import tqdm
from ludwig.api_annotations import DeveloperAPI
from ludwig.backend import Backend
from ludwig.constants import (
COLUMN,
COMBINER,
ENCODER,
EXECUTOR,
HYPEROPT,
INPUT_FEATURES,
PREPROCESSING,
SCHEDULER,
SEARCH_ALG,
SPLIT,
TEXT,
TYPE,
)
from ludwig.types import ModelConfigDict
from ludwig.utils.automl.data_source import DataSource, wrap_data_source
from ludwig.utils.automl.field_info import FieldConfig, FieldInfo, FieldMetadata
from ludwig.utils.automl.type_inference import infer_type, should_exclude
from ludwig.utils.data_utils import load_yaml
from ludwig.utils.misc_utils import merge_dict
from ludwig.utils.system_utils import Resources
logger = logging.getLogger(__name__)
PATH_HERE = os.path.abspath(os.path.dirname(__file__))
CONFIG_DIR = os.path.join(PATH_HERE, "defaults")
BASE_AUTOML_CONFIG = os.path.join(CONFIG_DIR, "base_automl_config.yaml")
REFERENCE_CONFIGS = os.path.join(CONFIG_DIR, "reference_configs.yaml")
combiner_defaults = {
"concat": os.path.join(CONFIG_DIR, "combiner/concat_config.yaml"),
"tabnet": os.path.join(CONFIG_DIR, "combiner/tabnet_config.yaml"),
"transformer": os.path.join(CONFIG_DIR, "combiner/transformer_config.yaml"),
}
encoder_defaults = {"text": {"bert": os.path.join(CONFIG_DIR, "text/bert_config.yaml")}}
# Cap for number of distinct values to return.
MAX_DISTINCT_VALUES_TO_RETURN = 10
@DeveloperAPI
@dataclass_json(letter_case=LetterCase.CAMEL)
@dataclass
class DatasetInfo:
fields: list[FieldInfo]
row_count: int
size_bytes: int = -1
def allocate_experiment_resources(resources: Resources) -> dict:
"""Allocates ray trial resources based on available resources.
# Inputs :param resources (dict) specifies all available GPUs, CPUs and associated metadata of the machines
(i.e. memory)
# Return
:return: (dict) gpu and cpu resources per trial
"""
# TODO (ASN):
# (1) expand logic to support multiple GPUs per trial (multi-gpu training)
# (2) add support for kubernetes namespace (if applicable)
# (3) add support for smarter allocation based on size of GPU memory
experiment_resources = {"cpu_resources_per_trial": 1}
gpu_count, cpu_count = resources.gpus, resources.cpus
if gpu_count > 0:
experiment_resources.update({"gpu_resources_per_trial": 1})
if cpu_count > 1:
cpus_per_trial = max(int(cpu_count / gpu_count), 1)
experiment_resources["cpu_resources_per_trial"] = cpus_per_trial
return experiment_resources
def get_resource_aware_hyperopt_config(
experiment_resources: dict[str, Any], time_limit_s: int | float, random_seed: int
) -> dict[str, Any]:
"""Returns a Ludwig config with the hyperopt section populated with appropriate parameters.
Hyperopt parameters are intended to be appropriate for the given resources and time limit.
"""
executor = experiment_resources
executor.update({"time_budget_s": time_limit_s})
if time_limit_s is not None:
executor.update({SCHEDULER: {"max_t": time_limit_s}})
return {
HYPEROPT: {
SEARCH_ALG: {"random_state_seed": random_seed},
EXECUTOR: executor,
},
}
def _get_stratify_split_config(field_meta: FieldMetadata) -> dict:
return {
PREPROCESSING: {
SPLIT: {
TYPE: "stratify",
COLUMN: field_meta.name,
}
}
}
def get_default_automl_hyperopt() -> dict[str, Any]:
"""Returns general, default settings for hyperopt.
For example:
- We set a random_state_seed for sample sequence repeatability
- We use an increased reduction_factor to get more pruning/exploration.
TODO: If settings seem reasonable, consider building this into the hyperopt schema, directly.
"""
return yaml.safe_load("""
search_alg:
type: variant_generator
executor:
type: ray
num_samples: 10
time_budget_s: 3600
scheduler:
type: async_hyperband
time_attr: time_total_s
max_t: 3600
grace_period: 72
reduction_factor: 5
""")
def create_default_config(
features_config: ModelConfigDict,
dataset_info: DatasetInfo,
target_name: str | list[str],
time_limit_s: int | float,
random_seed: int,
imbalance_threshold: float = 0.9,
backend: Backend = None,
) -> dict:
"""Returns auto_train configs for three available combiner models. Coordinates the following tasks:
- extracts fields and generates list of FieldInfo objects
- gets field metadata (i.e avg. words, total non-null entries)
- builds input_features and output_features section of config
- for imbalanced datasets, a preprocessing section is added to perform stratified sampling if the imbalance ratio
is smaller than imbalance_threshold
- for each combiner, adds default training, hyperopt
- infers resource constraints and adds gpu and cpu resource allocation per
trial
# Inputs
:param dataset_info: (str) filepath Dataset Info object.
:param target_name: (str, List[str]) name of target feature
:param time_limit_s: (int, float) total time allocated to auto_train. acts
as the stopping parameter
:param random_seed: (int, default: `42`) a random seed that will be used anywhere
there is a call to a random number generator, including
hyperparameter search sampling, as well as data splitting,
parameter initialization and training set shuffling
:param imbalance_threshold: (float) maximum imbalance ratio (minority / majority) to perform stratified sampling
:param backend: (Backend) backend to use for training.
# Return
:return: (dict) dictionaries contain auto train config files for all available
combiner types
"""
base_automl_config = load_yaml(BASE_AUTOML_CONFIG)
base_automl_config.update(features_config)
targets = convert_targets(target_name)
features_metadata = get_field_metadata(dataset_info.fields, dataset_info.row_count, targets)
# Handle expensive features for CPU
resources = backend.get_available_resources()
for ifeature in base_automl_config[INPUT_FEATURES]:
if resources.gpus == 0:
if ifeature[TYPE] == TEXT:
# When no GPUs are available, default to the embed encoder, which is fast enough for CPU
ifeature[ENCODER] = {"type": "embed"}
# create set of all feature types appearing in the dataset
feature_types = [[feat[TYPE] for feat in features] for features in features_config.values()]
feature_types = set(sum(feature_types, []))
model_configs = {}
# update hyperopt config
experiment_resources = allocate_experiment_resources(resources)
base_automl_config = merge_dict(
base_automl_config, get_resource_aware_hyperopt_config(experiment_resources, time_limit_s, random_seed)
)
# add preprocessing section if single output feature is imbalanced
outputs_metadata = [f for f in features_metadata if f.mode == "output"]
if len(outputs_metadata) == 1:
of_meta = outputs_metadata[0]
is_categorical = of_meta.config.type in ["category", "binary"]
is_imbalanced = of_meta.imbalance_ratio < imbalance_threshold
if is_categorical and is_imbalanced:
base_automl_config.update(_get_stratify_split_config(of_meta))
model_configs["base_config"] = base_automl_config
# read in all encoder configs
for feat_type, default_configs in encoder_defaults.items():
if feat_type in feature_types:
if feat_type not in model_configs.keys():
model_configs[feat_type] = {}
for encoder_name, encoder_config_path in default_configs.items():
model_configs[feat_type][encoder_name] = load_yaml(encoder_config_path)
# read in all combiner configs
model_configs[COMBINER] = {}
for combiner_type, default_config in combiner_defaults.items():
combiner_config = load_yaml(default_config)
model_configs[COMBINER][combiner_type] = combiner_config
return model_configs
# Read in the score and configuration of a reference model trained by Ludwig for each dataset in a list.
def get_reference_configs() -> dict:
reference_configs = load_yaml(REFERENCE_CONFIGS)
return reference_configs
def get_dataset_info(df: pd.DataFrame | dd.DataFrame) -> DatasetInfo:
"""Constructs FieldInfo objects for each feature in dataset. These objects are used for downstream type
inference.
# Inputs
:param df: (Union[pd.DataFrame, dd.DataFrame]) Pandas or Dask dataframe. # Return
:return: (DatasetInfo) Structure containing list of FieldInfo objects.
"""
source = wrap_data_source(df)
return get_dataset_info_from_source(source)
def is_field_boolean(source: DataSource, field: str) -> bool:
"""Returns a boolean indicating whether the object field should have a bool dtype.
Columns with object dtype that have 3 distinct values of which one is Nan/None is a bool type column.
"""
unique_values = source.df[field].unique()
if len(unique_values) <= 3:
for entry in unique_values:
try:
if np.isnan(entry):
continue
except TypeError:
# For some field types such as object arrays, np.isnan throws a TypeError
# In this case, do nothing and proceed to checking if the entry is a bool object
pass
if isinstance(entry, bool):
continue
return False
return True
return False
@DeveloperAPI
def get_dataset_info_from_source(source: DataSource) -> DatasetInfo:
"""Constructs FieldInfo objects for each feature in dataset. These objects are used for downstream type
inference.
# Inputs
:param source: (DataSource) A wrapper around a data source, which may represent a pandas or Dask dataframe. # Return
:return: (DatasetInfo) Structure containing list of FieldInfo objects.
"""
row_count = len(source)
fields = []
for field in tqdm(source.columns, desc="Analyzing fields", total=len(source.columns)):
logger.info(f"Analyzing field: {field}")
dtype = source.get_dtype(field)
num_distinct_values, distinct_values, distinct_values_balance = source.get_distinct_values(
field, MAX_DISTINCT_VALUES_TO_RETURN
)
nonnull_values = source.get_nonnull_values(field)
image_values = source.get_image_values(field)
audio_values = source.get_audio_values(field)
if dtype == "object":
# Check if it is a nullboolean field. We do this since if you read a csv with
# pandas that has a column of booleans and some missing values, the column is
# interpreted as object dtype instead of bool
if is_field_boolean(source, field):
dtype = "bool"
avg_words = None
if source.is_string_type(dtype):
try:
avg_words = source.get_avg_num_tokens(field)
except AttributeError:
# Series is not actually a string type despite being an object, e.g., Decimal, Datetime, etc.
avg_words = None
fields.append(
FieldInfo(
name=field,
dtype=dtype,
distinct_values=distinct_values,
num_distinct_values=num_distinct_values,
distinct_values_balance=distinct_values_balance,
nonnull_values=nonnull_values,
image_values=image_values,
audio_values=audio_values,
avg_words=avg_words,
)
)
return DatasetInfo(fields=fields, row_count=row_count, size_bytes=source.size_bytes())
def get_features_config(
fields: list[FieldInfo],
row_count: int,
target_name: str | list[str] = None,
) -> dict:
"""Constructs FieldInfo objects for each feature in dataset. These objects are used for downstream type
inference.
# Inputs
:param fields: (List[FieldInfo]) FieldInfo objects for all fields in dataset
:param row_count: (int) total number of entries in original dataset :param target_name (str, List[str]) name of
target feature # Return
:return: (dict) section of auto_train config for input_features and output_features
"""
targets = convert_targets(target_name)
metadata = get_field_metadata(fields, row_count, targets)
return get_config_from_metadata(metadata, targets)
def convert_targets(target_name: str | list[str] = None) -> set[str]:
targets = target_name
if isinstance(targets, str):
targets = [targets]
if targets is None:
targets = []
return set(targets)
def get_config_from_metadata(metadata: list[FieldMetadata], targets: set[str] = None) -> dict:
"""Builds input/output feature sections of auto-train config using field metadata.
# Inputs
:param metadata: (List[FieldMetadata]) field descriptions :param targets (Set[str]) names of target features #
Return
:return: (dict) section of auto_train config for input_features and output_features
"""
config = {
"input_features": [],
"output_features": [],
}
for field_meta in metadata:
if field_meta.name in targets:
config["output_features"].append(field_meta.config.to_dict())
elif not field_meta.excluded and field_meta.mode == "input":
config["input_features"].append(field_meta.config.to_dict())
return config
@DeveloperAPI
def get_field_metadata(fields: list[FieldInfo], row_count: int, targets: set[str] = None) -> list[FieldMetadata]:
"""Computes metadata for each field in dataset.
# Inputs
:param fields: (List[FieldInfo]) FieldInfo objects for all fields in dataset
:param row_count: (int) total number of entries in original dataset :param targets (Set[str]) names of target
features # Return
:return: (List[FieldMetadata]) list of objects containing metadata for each field
"""
metadata = []
column_count = len(fields)
for idx, field in enumerate(fields):
missing_value_percent = 1 - float(field.nonnull_values) / row_count
dtype = infer_type(field, missing_value_percent, row_count)
metadata.append(
FieldMetadata(
name=field.name,
config=FieldConfig(
name=field.name,
column=field.name,
type=dtype,
),
excluded=should_exclude(idx, field, dtype, column_count, row_count, targets),
mode=infer_mode(field, targets),
missing_values=missing_value_percent,
imbalance_ratio=field.distinct_values_balance,
)
)
return metadata
def infer_mode(field: FieldInfo, targets: set[str] = None) -> str:
if field.name in targets:
return "output"
if field.name.lower() == "split":
return "split"
return "input"
================================================
FILE: ludwig/automl/defaults/base_automl_config.yaml
================================================
trainer:
batch_size: auto #256
learning_rate: auto #.001
# validation_metric: accuracy
hyperopt:
search_alg:
# Gives results like default + supports random_state_seed for sample sequence repeatability
type: variant_generator
executor:
type: ray
num_samples: 10
time_budget_s: 7200
scheduler:
type: async_hyperband
time_attr: time_total_s
max_t: 7200
grace_period: 72
# Increased over default to get more pruning/exploration
reduction_factor: 5
================================================
FILE: ludwig/automl/defaults/combiner/concat_config.yaml
================================================
combiner:
type: concat
hyperopt:
# goal: maximize
parameters:
combiner.num_fc_layers:
space: randint
lower: 1
upper: 4
combiner.output_size:
space: choice
categories: [128, 256]
combiner.dropout:
space: uniform
lower: 0.0
upper: 0.1
# This needs to be loguniform due to invalid schemas created by merging with a choice parameter space. See the
# comment in ludwig/automl/defaults/text/bert_config.yaml for more information.
trainer.learning_rate:
space: loguniform
lower: 0.00002
upper: 0.001
trainer.batch_size:
space: choice
categories: [64, 128, 256, 512, 1024]
================================================
FILE: ludwig/automl/defaults/combiner/tabnet_config.yaml
================================================
combiner:
type: tabnet
trainer:
batch_size: auto
learning_rate_scaling: sqrt
learning_rate_scheduler:
decay: exponential
decay_steps: 20000
decay_rate: 0.8
optimizer:
type: adam
hyperopt:
parameters:
trainer.learning_rate:
space: loguniform
lower: 0.00002
upper: 0.001
trainer.learning_rate_scheduler.decay_rate:
space: choice
categories: [0.8, 0.9, 0.95]
trainer.learning_rate_scheduler.decay_steps:
space: choice
categories: [500, 2000, 8000, 10000, 20000]
combiner.size:
space: choice
categories: [8, 16, 24, 32, 64]
combiner.output_size:
space: choice
categories: [8, 16, 24, 32, 64, 128]
combiner.num_steps:
space: choice
categories: [3, 4, 5, 6, 7, 8, 9, 10]
combiner.relaxation_factor:
space: choice
categories: [1.0, 1.2, 1.5, 2.0]
combiner.sparsity:
space: choice
categories: [0.0, 0.000001, 0.0001, 0.001, 0.01, 0.1]
combiner.bn_virtual_bs:
space: choice
categories: [256, 512, 1024, 2048, 4096]
combiner.bn_momentum:
space: choice
categories: [0.4, 0.3, 0.2, 0.1, 0.05, 0.02]
================================================
FILE: ludwig/automl/defaults/combiner/transformer_config.yaml
================================================
combiner:
type: transformer
trainer:
batch_size: auto #256
learning_rate: auto #0.0001
# validation_metric: accuracy
hyperopt:
# goal: maximize
parameters:
trainer.learning_rate:
space: loguniform
lower: 0.00002
upper: 0.001
trainer.batch_size:
space: choice
categories: [64, 128, 256]
combiner.num_heads:
space: choice
categories: [4]
combiner.dropout:
space: uniform
lower: 0.1
upper: 0.3
combiner.num_layers:
space: randint
lower: 3
upper: 4
combiner.num_fc_layers:
space: choice
categories: [1, 2]
combiner.fc_dropout:
space: uniform
lower: 0.1
upper: 0.5
================================================
FILE: ludwig/automl/defaults/reference_configs.yaml
================================================
# Record the score and configuration of a reference model trained by Ludwig for specified datasets.
# This information is useful for Ludwig AutoML hyperparameter transfer learning or for manual experimentation.
datasets:
- name: adult_census_income
goal: maximize
metric: accuracy
validation_metric_score: 0.8682432174682617
training_rows: 29305
test_rows: 16281
validation_rows: 3256
config:
output_features:
- name: income
type: category
input_features:
- name: age
type: number
- name: workclass
type: category
- name: fnlwgt
type: number
- name: education
type: category
- name: education-num
type: number
- name: marital-status
type: category
- name: occupation
type: category
- name: relationship
type: category
- name: race
type: category
- name: sex
type: category
- name: capital-gain
type: number
- name: capital-loss
type: number
- name: hours-per-week
type: number
- name: native-country
type: category
combiner:
type: tabnet
size: 8 # N_a
output_size: 128 # N_d
sparsity: 0.0 # lambda_sparse
bn_momentum: 0.4 # m_B
num_steps: 3 # N_steps
relaxation_factor: 1.0 # gamma
bn_virtual_bs: 4096 # B_v
trainer:
batch_size: 256 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.01
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 500
decay_rate: 0.95
validation_metric: accuracy
- name: allstate_claims_severity
goal: minimize
metric: root_mean_squared_error
validation_metric_score: 1915.5531005859375
training_rows: 131726
test_rows: 37750
validation_rows: 18842
config:
output_features:
- name: loss
type: number
input_features:
- column: cat1
name: cat1
type: category
- column: cat2
name: cat2
type: category
- column: cat3
name: cat3
type: category
- column: cat4
name: cat4
type: category
- column: cat5
name: cat5
type: category
- column: cat6
name: cat6
type: category
- column: cat7
name: cat7
type: category
- column: cat8
name: cat8
type: category
- column: cat9
name: cat9
type: category
- column: cat10
name: cat10
type: category
- column: cat11
name: cat11
type: category
- column: cat12
name: cat12
type: category
- column: cat13
name: cat13
type: category
- column: cat14
name: cat14
type: category
- column: cat15
name: cat15
type: category
- column: cat16
name: cat16
type: category
- column: cat17
name: cat17
type: category
- column: cat18
name: cat18
type: category
- column: cat19
name: cat19
type: category
- column: cat20
name: cat20
type: category
- column: cat21
name: cat21
type: category
- column: cat22
name: cat22
type: category
- column: cat23
name: cat23
type: category
- column: cat24
name: cat24
type: category
- column: cat25
name: cat25
type: category
- column: cat26
name: cat26
type: category
- column: cat27
name: cat27
type: category
- column: cat28
name: cat28
type: category
- column: cat29
name: cat29
type: category
- column: cat30
name: cat30
type: category
- column: cat31
name: cat31
type: category
- column: cat32
name: cat32
type: category
- column: cat33
name: cat33
type: category
- column: cat34
name: cat34
type: category
- column: cat35
name: cat35
type: category
- column: cat36
name: cat36
type: category
- column: cat37
name: cat37
type: category
- column: cat38
name: cat38
type: category
- column: cat39
name: cat39
type: category
- column: cat40
name: cat40
type: category
- column: cat41
name: cat41
type: category
- column: cat42
name: cat42
type: category
- column: cat43
name: cat43
type: category
- column: cat44
name: cat44
type: category
- column: cat45
name: cat45
type: category
- column: cat46
name: cat46
type: category
- column: cat47
name: cat47
type: category
- column: cat48
name: cat48
type: category
- column: cat49
name: cat49
type: category
- column: cat50
name: cat50
type: category
- column: cat51
name: cat51
type: category
- column: cat52
name: cat52
type: category
- column: cat53
name: cat53
type: category
- column: cat54
name: cat54
type: category
- column: cat55
name: cat55
type: category
- column: cat56
name: cat56
type: category
- column: cat57
name: cat57
type: category
- column: cat58
name: cat58
type: category
- column: cat59
name: cat59
type: category
- column: cat60
name: cat60
type: category
- column: cat61
name: cat61
type: category
- column: cat62
name: cat62
type: category
- column: cat63
name: cat63
type: category
- column: cat64
name: cat64
type: category
- column: cat65
name: cat65
type: category
- column: cat66
name: cat66
type: category
- column: cat67
name: cat67
type: category
- column: cat68
name: cat68
type: category
- column: cat69
name: cat69
type: category
- column: cat70
name: cat70
type: category
- column: cat71
name: cat71
type: category
- column: cat72
name: cat72
type: category
- column: cat73
name: cat73
type: category
- column: cat74
name: cat74
type: category
- column: cat75
name: cat75
type: category
- column: cat76
name: cat76
type: category
- column: cat77
name: cat77
type: category
- column: cat78
name: cat78
type: category
- column: cat79
name: cat79
type: category
- column: cat80
name: cat80
type: category
- column: cat81
name: cat81
type: category
- column: cat82
name: cat82
type: category
- column: cat83
name: cat83
type: category
- column: cat84
name: cat84
type: category
- column: cat85
name: cat85
type: category
- column: cat86
name: cat86
type: category
- column: cat87
name: cat87
type: category
- column: cat88
name: cat88
type: category
- column: cat89
name: cat89
type: category
- column: cat90
name: cat90
type: category
- column: cat91
name: cat91
type: category
- column: cat92
name: cat92
type: category
- column: cat93
name: cat93
type: category
- column: cat94
name: cat94
type: category
- column: cat95
name: cat95
type: category
- column: cat96
name: cat96
type: category
- column: cat97
name: cat97
type: category
- column: cat98
name: cat98
type: category
- column: cat99
name: cat99
type: category
- column: cat100
name: cat100
type: category
- column: cat101
name: cat101
type: category
- column: cat102
name: cat102
type: category
- column: cat103
name: cat103
type: category
- column: cat104
name: cat104
type: category
- column: cat105
name: cat105
type: category
- column: cat106
name: cat106
type: category
- column: cat107
name: cat107
type: category
- column: cat108
name: cat108
type: category
- column: cat109
name: cat109
type: category
- column: cat110
name: cat110
type: category
- column: cat111
name: cat111
type: category
- column: cat112
name: cat112
type: category
- column: cat113
name: cat113
type: category
- column: cat114
name: cat114
type: category
- column: cat115
name: cat115
type: category
- column: cat116
name: cat116
type: category
- column: cont1
name: cont1
type: number
- column: cont2
name: cont2
type: number
- column: cont3
name: cont3
type: number
- column: cont4
name: cont4
type: number
- column: cont5
name: cont5
type: number
- column: cont6
name: cont6
type: number
- column: cont7
name: cont7
type: number
- column: cont8
name: cont8
type: number
- column: cont9
name: cont9
type: number
- column: cont10
name: cont10
type: number
- column: cont11
name: cont11
type: number
- column: cont12
name: cont12
type: number
- column: cont13
name: cont13
type: number
- column: cont14
name: cont14
type: number
combiner:
type: tabnet
size: 128 # N_a
output_size: 8 # N_d
sparsity: 0.0 # lambda_sparse
bn_momentum: 0.02 # m_B
num_steps: 10 # N_steps
relaxation_factor: 1.0 # gamma
bn_virtual_bs: 4096 # B_v
trainer:
batch_size: 256 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.01
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 10000
decay_rate: 0.9
validation_metric: root_mean_squared_error
- name: bnp_claims_management
goal: maximize
metric: accuracy
validation_metric_score: 0.7761691808700562
training_rows: 80101
test_rows: 22823
validation_rows: 11397
config:
output_features:
- name: target
type: binary
input_features:
- name: v1
type: number
- name: v2
type: number
- name: v3
type: category
- name: v4
type: number
- name: v5
type: number
- name: v6
type: number
- name: v7
type: number
- name: v8
type: number
- name: v9
type: number
- name: v10
type: number
- name: v11
type: number
- name: v12
type: number
- name: v13
type: number
- name: v14
type: number
- name: v15
type: number
- name: v16
type: number
- name: v17
type: number
- name: v18
type: number
- name: v19
type: number
- name: v20
type: number
- name: v21
type: number
- name: v22
type: category
- name: v23
type: number
- name: v24
type: category
- name: v25
type: number
- name: v26
type: number
- name: v27
type: number
- name: v28
type: number
- name: v29
type: number
- name: v30
type: category
- name: v31
type: category
- name: v32
type: number
- name: v33
type: number
- name: v34
type: number
- name: v35
type: number
- name: v36
type: number
- name: v37
type: number
- name: v38
type: number
- name: v39
type: number
- name: v40
type: number
- name: v41
type: number
- name: v42
type: number
- name: v43
type: number
- name: v44
type: number
- name: v45
type: number
- name: v46
type: number
- name: v47
type: category
- name: v48
type: number
- name: v49
type: number
- name: v50
type: number
- name: v51
type: number
- name: v52
type: category
- name: v53
type: number
- name: v54
type: number
- name: v55
type: number
- name: v56
type: category
- name: v57
type: number
- name: v58
type: number
- name: v59
type: number
- name: v60
type: number
- name: v61
type: number
- name: v62
type: category
- name: v63
type: number
- name: v64
type: number
- name: v65
type: number
- name: v66
type: category
- name: v67
type: number
- name: v68
type: number
- name: v69
type: number
- name: v70
type: number
- name: v71
type: category
- name: v72
type: category
- name: v73
type: number
- name: v74
type: category
- name: v75
type: category
- name: v76
type: number
- name: v77
type: number
- name: v78
type: number
- name: v79
type: category
- name: v80
type: number
- name: v81
type: number
- name: v82
type: number
- name: v83
type: number
- name: v84
type: number
- name: v85
type: number
- name: v86
type: number
- name: v87
type: number
- name: v88
type: number
- name: v89
type: number
- name: v90
type: number
- name: v91
type: category
- name: v92
type: number
- name: v93
type: number
- name: v94
type: number
- name: v95
type: number
- name: v96
type: number
- name: v97
type: number
- name: v98
type: number
- name: v99
type: number
- name: v100
type: number
- name: v101
type: number
- name: v102
type: number
- name: v103
type: number
- name: v104
type: number
- name: v105
type: number
- name: v106
type: number
- name: v107
type: category
- name: v108
type: number
- name: v109
type: number
- name: v110
type: category
- name: v111
type: number
- name: v112
type: category
- name: v113
type: category
- name: v114
type: number
- name: v115
type: number
- name: v116
type: number
- name: v117
type: number
- name: v118
type: number
- name: v119
type: number
- name: v120
type: number
- name: v121
type: number
- name: v122
type: number
- name: v123
type: number
- name: v124
type: number
- name: v125
type: category
- name: v126
type: number
- name: v127
type: number
- name: v128
type: number
- name: v129
type: number
- name: v130
type: number
- name: v131
type: number
combiner:
type: tabnet
size: 32 # N_a
output_size: 8 # N_d
sparsity: 0.0 # lambda_sparse
bn_momentum: 0.02 # m_B
num_steps: 3 # N_steps
relaxation_factor: 1.0 # gamma
bn_virtual_bs: 256 # B_v
trainer:
batch_size: 256 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.01
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 2000
decay_rate: 0.4
validation_metric: accuracy
- name: ieee_fraud
goal: maximize
metric: accuracy
validation_metric_score: 0.9836957454681396
training_rows: 413498
test_rows: 118039
validation_rows: 59003
config:
output_features:
- name: isFraud
type: binary
input_features:
- name: TransactionDT
type: number
- name: TransactionAmt
type: number
- name: ProductCD
type: category
- name: card1
type: number
- name: card2
type: number
- name: card3
type: number
- name: card4
type: category
- name: card5
type: number
- name: card6
type: category
- name: addr1
type: number
- name: addr2
type: number
- name: dist1
type: number
- name: dist2
type: number
- name: P_emaildomain
type: category
- name: R_emaildomain
type: category
- name: C1
type: number
- name: C2
type: number
- name: C3
type: number
- name: C4
type: number
- name: C5
type: number
- name: C6
type: number
- name: C7
type: number
- name: C8
type: number
- name: C9
type: number
- name: C10
type: number
- name: C11
type: number
- name: C12
type: number
- name: C13
type: number
- name: C14
type: number
- name: D1
type: number
- name: D2
type: number
- name: D3
type: number
- name: D4
type: number
- name: D5
type: number
- name: D6
type: number
- name: D7
type: number
- name: D8
type: number
- name: D9
type: number
- name: D10
type: number
- name: D11
type: number
- name: D12
type: number
- name: D13
type: number
- name: D14
type: number
- name: D15
type: number
- name: M1
type: category
- name: M2
type: category
- name: M3
type: category
- name: M4
type: category
- name: M5
type: category
- name: M6
type: category
- name: M7
type: category
- name: M8
type: category
- name: M9
type: category
- name: V1
type: number
- name: V2
type: number
- name: V3
type: number
- name: V4
type: number
- name: V5
type: number
- name: V6
type: number
- name: V7
type: number
- name: V8
type: number
- name: V9
type: number
- name: V10
type: number
- name: V11
type: number
- name: V12
type: number
- name: V13
type: number
- name: V14
type: number
- name: V15
type: number
- name: V16
type: number
- name: V17
type: number
- name: V18
type: number
- name: V19
type: number
- name: V20
type: number
- name: V21
type: number
- name: V22
type: number
- name: V23
type: number
- name: V24
type: number
- name: V25
type: number
- name: V26
type: number
- name: V27
type: number
- name: V28
type: number
- name: V29
type: number
- name: V30
type: number
- name: V31
type: number
- name: V32
type: number
- name: V33
type: number
- name: V34
type: number
- name: V35
type: number
- name: V36
type: number
- name: V37
type: number
- name: V38
type: number
- name: V39
type: number
- name: V40
type: number
- name: V41
type: number
- name: V42
type: number
- name: V43
type: number
- name: V44
type: number
- name: V45
type: number
- name: V46
type: number
- name: V47
type: number
- name: V48
type: number
- name: V49
type: number
- name: V50
type: number
- name: V51
type: number
- name: V52
type: number
- name: V53
type: number
- name: V54
type: number
- name: V55
type: number
- name: V56
type: number
- name: V57
type: number
- name: V58
type: number
- name: V59
type: number
- name: V60
type: number
- name: V61
type: number
- name: V62
type: number
- name: V63
type: number
- name: V64
type: number
- name: V65
type: number
- name: V66
type: number
- name: V67
type: number
- name: V68
type: number
- name: V69
type: number
- name: V70
type: number
- name: V71
type: number
- name: V72
type: number
- name: V73
type: number
- name: V74
type: number
- name: V75
type: number
- name: V76
type: number
- name: V77
type: number
- name: V78
type: number
- name: V79
type: number
- name: V80
type: number
- name: V81
type: number
- name: V82
type: number
- name: V83
type: number
- name: V84
type: number
- name: V85
type: number
- name: V86
type: number
- name: V87
type: number
- name: V88
type: number
- name: V89
type: number
- name: V90
type: number
- name: V91
type: number
- name: V92
type: number
- name: V93
type: number
- name: V94
type: number
- name: V95
type: number
- name: V96
type: number
- name: V97
type: number
- name: V98
type: number
- name: V99
type: number
- name: V100
type: number
- name: V101
type: number
- name: V102
type: number
- name: V103
type: number
- name: V104
type: number
- name: V105
type: number
- name: V106
type: number
- name: V107
type: number
- name: V108
type: number
- name: V109
type: number
- name: V110
type: number
- name: V111
type: number
- name: V112
type: number
- name: V113
type: number
- name: V114
type: number
- name: V115
type: number
- name: V116
type: number
- name: V117
type: number
- name: V118
type: number
- name: V119
type: number
- name: V120
type: number
- name: V121
type: number
- name: V122
type: number
- name: V123
type: number
- name: V124
type: number
- name: V125
type: number
- name: V126
type: number
- name: V127
type: number
- name: V128
type: number
- name: V129
type: number
- name: V130
type: number
- name: V131
type: number
- name: V132
type: number
- name: V133
type: number
- name: V134
type: number
- name: V135
type: number
- name: V136
type: number
- name: V137
type: number
- name: V138
type: number
- name: V139
type: number
- name: V140
type: number
- name: V141
type: number
- name: V142
type: number
- name: V143
type: number
- name: V144
type: number
- name: V145
type: number
- name: V146
type: number
- name: V147
type: number
- name: V148
type: number
- name: V149
type: number
- name: V150
type: number
- name: V151
type: number
- name: V152
type: number
- name: V153
type: number
- name: V154
type: number
- name: V155
type: number
- name: V156
type: number
- name: V157
type: number
- name: V158
type: number
- name: V159
type: number
- name: V160
type: number
- name: V161
type: number
- name: V162
type: number
- name: V163
type: number
- name: V164
type: number
- name: V165
type: number
- name: V166
type: number
- name: V167
type: number
- name: V168
type: number
- name: V169
type: number
- name: V170
type: number
- name: V171
type: number
- name: V172
type: number
- name: V173
type: number
- name: V174
type: number
- name: V175
type: number
- name: V176
type: number
- name: V177
type: number
- name: V178
type: number
- name: V179
type: number
- name: V180
type: number
- name: V181
type: number
- name: V182
type: number
- name: V183
type: number
- name: V184
type: number
- name: V185
type: number
- name: V186
type: number
- name: V187
type: number
- name: V188
type: number
- name: V189
type: number
- name: V190
type: number
- name: V191
type: number
- name: V192
type: number
- name: V193
type: number
- name: V194
type: number
- name: V195
type: number
- name: V196
type: number
- name: V197
type: number
- name: V198
type: number
- name: V199
type: number
- name: V200
type: number
- name: V201
type: number
- name: V202
type: number
- name: V203
type: number
- name: V204
type: number
- name: V205
type: number
- name: V206
type: number
- name: V207
type: number
- name: V208
type: number
- name: V209
type: number
- name: V210
type: number
- name: V211
type: number
- name: V212
type: number
- name: V213
type: number
- name: V214
type: number
- name: V215
type: number
- name: V216
type: number
- name: V217
type: number
- name: V218
type: number
- name: V219
type: number
- name: V220
type: number
- name: V221
type: number
- name: V222
type: number
- name: V223
type: number
- name: V224
type: number
- name: V225
type: number
- name: V226
type: number
- name: V227
type: number
- name: V228
type: number
- name: V229
type: number
- name: V230
type: number
- name: V231
type: number
- name: V232
type: number
- name: V233
type: number
- name: V234
type: number
- name: V235
type: number
- name: V236
type: number
- name: V237
type: number
- name: V238
type: number
- name: V239
type: number
- name: V240
type: number
- name: V241
type: number
- name: V242
type: number
- name: V243
type: number
- name: V244
type: number
- name: V245
type: number
- name: V246
type: number
- name: V247
type: number
- name: V248
type: number
- name: V249
type: number
- name: V250
type: number
- name: V251
type: number
- name: V252
type: number
- name: V253
type: number
- name: V254
type: number
- name: V255
type: number
- name: V256
type: number
- name: V257
type: number
- name: V258
type: number
- name: V259
type: number
- name: V260
type: number
- name: V261
type: number
- name: V262
type: number
- name: V263
type: number
- name: V264
type: number
- name: V265
type: number
- name: V266
type: number
- name: V267
type: number
- name: V268
type: number
- name: V269
type: number
- name: V270
type: number
- name: V271
type: number
- name: V272
type: number
- name: V273
type: number
- name: V274
type: number
- name: V275
type: number
- name: V276
type: number
- name: V277
type: number
- name: V278
type: number
- name: V279
type: number
- name: V280
type: number
- name: V281
type: number
- name: V282
type: number
- name: V283
type: number
- name: V284
type: number
- name: V285
type: number
- name: V286
type: number
- name: V287
type: number
- name: V288
type: number
- name: V289
type: number
- name: V290
type: number
- name: V291
type: number
- name: V292
type: number
- name: V293
type: number
- name: V294
type: number
- name: V295
type: number
- name: V296
type: number
- name: V297
type: number
- name: V298
type: number
- name: V299
type: number
- name: V300
type: number
- name: V301
type: number
- name: V302
type: number
- name: V303
type: number
- name: V304
type: number
- name: V305
type: number
- name: V306
type: number
- name: V307
type: number
- name: V308
type: number
- name: V309
type: number
- name: V310
type: number
- name: V311
type: number
- name: V312
type: number
- name: V313
type: number
- name: V314
type: number
- name: V315
type: number
- name: V316
type: number
- name: V317
type: number
- name: V318
type: number
- name: V319
type: number
- name: V320
type: number
- name: V321
type: number
- name: V322
type: number
- name: V323
type: number
- name: V324
type: number
- name: V325
type: number
- name: V326
type: number
- name: V327
type: number
- name: V328
type: number
- name: V329
type: number
- name: V330
type: number
- name: V331
type: number
- name: V332
type: number
- name: V333
type: number
- name: V334
type: number
- name: V335
type: number
- name: V336
type: number
- name: V337
type: number
- name: V338
type: number
- name: V339
type: number
- name: id_01
type: number
- name: id_02
type: number
- name: id_03
type: number
- name: id_04
type: number
- name: id_05
type: number
- name: id_06
type: number
- name: id_07
type: number
- name: id_08
type: number
- name: id_09
type: number
- name: id_10
type: number
- name: id_11
type: number
- name: id_12
type: category
- name: id_13
type: number
- name: id_14
type: number
- name: id_15
type: category
- name: id_16
type: category
- name: id_17
type: number
- name: id_18
type: number
- name: id_19
type: number
- name: id_20
type: number
- name: id_21
type: number
- name: id_22
type: number
- name: id_23
type: category
- name: id_24
type: number
- name: id_25
type: number
- name: id_26
type: number
- name: id_27
type: category
- name: id_28
type: category
- name: id_29
type: category
- name: id_30
type: category
- name: id_31
type: text
- name: id_32
type: number
- name: id_33
type: category
- name: id_34
type: category
- name: id_35
type: category
- name: id_36
type: category
- name: id_37
type: category
- name: id_38
type: category
- name: DeviceType
type: category
- name: DeviceInfo
type: category
combiner:
type: tabnet
size: 128 # N_a
output_size: 24 # N_d
sparsity: 0.000001 # lambda_sparse
bn_momentum: 0.02 # m_B
num_steps: 10 # N_steps
relaxation_factor: 1.0 # gamma
bn_virtual_bs: 2048 # B_v
trainer:
batch_size: 256 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.01
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 10000
decay_rate: 0.95
validation_metric: accuracy
- name: mercedes_benz_greener
goal: minimize
metric: root_mean_squared_error
validation_metric_score: 7.685836315155029
training_rows: 2969
test_rows: 840
validation_rows: 400
config:
output_features:
- name: y
type: number
input_features:
- name: X0
type: category
- name: X1
type: category
- name: X2
type: category
- name: X3
type: category
- name: X4
type: category
- name: X5
type: category
- name: X6
type: category
- name: X8
type: category
- name: X10
type: binary
- name: X11
type: binary
- name: X12
type: binary
- name: X13
type: binary
- name: X14
type: binary
- name: X15
type: binary
- name: X16
type: binary
- name: X17
type: binary
- name: X18
type: binary
- name: X19
type: binary
- name: X20
type: binary
- name: X21
type: binary
- name: X22
type: binary
- name: X23
type: binary
- name: X24
type: binary
- name: X26
type: binary
- name: X27
type: binary
- name: X28
type: binary
- name: X29
type: binary
- name: X30
type: binary
- name: X31
type: binary
- name: X32
type: binary
- name: X33
type: binary
- name: X34
type: binary
- name: X35
type: binary
- name: X36
type: binary
- name: X37
type: binary
- name: X38
type: binary
- name: X39
type: binary
- name: X40
type: binary
- name: X41
type: binary
- name: X42
type: binary
- name: X43
type: binary
- name: X44
type: binary
- name: X45
type: binary
- name: X46
type: binary
- name: X47
type: binary
- name: X48
type: binary
- name: X49
type: binary
- name: X50
type: binary
- name: X51
type: binary
- name: X52
type: binary
- name: X53
type: binary
- name: X54
type: binary
- name: X55
type: binary
- name: X56
type: binary
- name: X57
type: binary
- name: X58
type: binary
- name: X59
type: binary
- name: X60
type: binary
- name: X61
type: binary
- name: X62
type: binary
- name: X63
type: binary
- name: X64
type: binary
- name: X65
type: binary
- name: X66
type: binary
- name: X67
type: binary
- name: X68
type: binary
- name: X69
type: binary
- name: X70
type: binary
- name: X71
type: binary
- name: X73
type: binary
- name: X74
type: binary
- name: X75
type: binary
- name: X76
type: binary
- name: X77
type: binary
- name: X78
type: binary
- name: X79
type: binary
- name: X80
type: binary
- name: X81
type: binary
- name: X82
type: binary
- name: X83
type: binary
- name: X84
type: binary
- name: X85
type: binary
- name: X86
type: binary
- name: X87
type: binary
- name: X88
type: binary
- name: X89
type: binary
- name: X90
type: binary
- name: X91
type: binary
- name: X92
type: binary
- name: X93
type: binary
- name: X94
type: binary
- name: X95
type: binary
- name: X96
type: binary
- name: X97
type: binary
- name: X98
type: binary
- name: X99
type: binary
- name: X100
type: binary
- name: X101
type: binary
- name: X102
type: binary
- name: X103
type: binary
- name: X104
type: binary
- name: X105
type: binary
- name: X106
type: binary
- name: X107
type: binary
- name: X108
type: binary
- name: X109
type: binary
- name: X110
type: binary
- name: X111
type: binary
- name: X112
type: binary
- name: X113
type: binary
- name: X114
type: binary
- name: X115
type: binary
- name: X116
type: binary
- name: X117
type: binary
- name: X118
type: binary
- name: X119
type: binary
- name: X120
type: binary
- name: X122
type: binary
- name: X123
type: binary
- name: X124
type: binary
- name: X125
type: binary
- name: X126
type: binary
- name: X127
type: binary
- name: X128
type: binary
- name: X129
type: binary
- name: X130
type: binary
- name: X131
type: binary
- name: X132
type: binary
- name: X133
type: binary
- name: X134
type: binary
- name: X135
type: binary
- name: X136
type: binary
- name: X137
type: binary
- name: X138
type: binary
- name: X139
type: binary
- name: X140
type: binary
- name: X141
type: binary
- name: X142
type: binary
- name: X143
type: binary
- name: X144
type: binary
- name: X145
type: binary
- name: X146
type: binary
- name: X147
type: binary
- name: X148
type: binary
- name: X150
type: binary
- name: X151
type: binary
- name: X152
type: binary
- name: X153
type: binary
- name: X154
type: binary
- name: X155
type: binary
- name: X156
type: binary
- name: X157
type: binary
- name: X158
type: binary
- name: X159
type: binary
- name: X160
type: binary
- name: X161
type: binary
- name: X162
type: binary
- name: X163
type: binary
- name: X164
type: binary
- name: X165
type: binary
- name: X166
type: binary
- name: X167
type: binary
- name: X168
type: binary
- name: X169
type: binary
- name: X170
type: binary
- name: X171
type: binary
- name: X172
type: binary
- name: X173
type: binary
- name: X174
type: binary
- name: X175
type: binary
- name: X176
type: binary
- name: X177
type: binary
- name: X178
type: binary
- name: X179
type: binary
- name: X180
type: binary
- name: X181
type: binary
- name: X182
type: binary
- name: X183
type: binary
- name: X184
type: binary
- name: X185
type: binary
- name: X186
type: binary
- name: X187
type: binary
- name: X189
type: binary
- name: X190
type: binary
- name: X191
type: binary
- name: X192
type: binary
- name: X194
type: binary
- name: X195
type: binary
- name: X196
type: binary
- name: X197
type: binary
- name: X198
type: binary
- name: X199
type: binary
- name: X200
type: binary
- name: X201
type: binary
- name: X202
type: binary
- name: X203
type: binary
- name: X204
type: binary
- name: X205
type: binary
- name: X206
type: binary
- name: X207
type: binary
- name: X208
type: binary
- name: X209
type: binary
- name: X210
type: binary
- name: X211
type: binary
- name: X212
type: binary
- name: X213
type: binary
- name: X214
type: binary
- name: X215
type: binary
- name: X216
type: binary
- name: X217
type: binary
- name: X218
type: binary
- name: X219
type: binary
- name: X220
type: binary
- name: X221
type: binary
- name: X222
type: binary
- name: X223
type: binary
- name: X224
type: binary
- name: X225
type: binary
- name: X226
type: binary
- name: X227
type: binary
- name: X228
type: binary
- name: X229
type: binary
- name: X230
type: binary
- name: X231
type: binary
- name: X232
type: binary
- name: X233
type: binary
- name: X234
type: binary
- name: X235
type: binary
- name: X236
type: binary
- name: X237
type: binary
- name: X238
type: binary
- name: X239
type: binary
- name: X240
type: binary
- name: X241
type: binary
- name: X242
type: binary
- name: X243
type: binary
- name: X244
type: binary
- name: X245
type: binary
- name: X246
type: binary
- name: X247
type: binary
- name: X248
type: binary
- name: X249
type: binary
- name: X250
type: binary
- name: X251
type: binary
- name: X252
type: binary
- name: X253
type: binary
- name: X254
type: binary
- name: X255
type: binary
- name: X256
type: binary
- name: X257
type: binary
- name: X258
type: binary
- name: X259
type: binary
- name: X260
type: binary
- name: X261
type: binary
- name: X262
type: binary
- name: X263
type: binary
- name: X264
type: binary
- name: X265
type: binary
- name: X266
type: binary
- name: X267
type: binary
- name: X268
type: binary
- name: X269
type: binary
- name: X270
type: binary
- name: X271
type: binary
- name: X272
type: binary
- name: X273
type: binary
- name: X274
type: binary
- name: X275
type: binary
- name: X276
type: binary
- name: X277
type: binary
- name: X278
type: binary
- name: X279
type: binary
- name: X280
type: binary
- name: X281
type: binary
- name: X282
type: binary
- name: X283
type: binary
- name: X284
type: binary
- name: X285
type: binary
- name: X286
type: binary
- name: X287
type: binary
- name: X288
type: binary
- name: X289
type: binary
- name: X290
type: binary
- name: X291
type: binary
- name: X292
type: binary
- name: X293
type: binary
- name: X294
type: binary
- name: X295
type: binary
- name: X296
type: binary
- name: X297
type: binary
- name: X298
type: binary
- name: X299
type: binary
- name: X300
type: binary
- name: X301
type: binary
- name: X302
type: binary
- name: X304
type: binary
- name: X305
type: binary
- name: X306
type: binary
- name: X307
type: binary
- name: X308
type: binary
- name: X309
type: binary
- name: X310
type: binary
- name: X311
type: binary
- name: X312
type: binary
- name: X313
type: binary
- name: X314
type: binary
- name: X315
type: binary
- name: X316
type: binary
- name: X317
type: binary
- name: X318
type: binary
- name: X319
type: binary
- name: X320
type: binary
- name: X321
type: binary
- name: X322
type: binary
- name: X323
type: binary
- name: X324
type: binary
- name: X325
type: binary
- name: X326
type: binary
- name: X327
type: binary
- name: X328
type: binary
- name: X329
type: binary
- name: X330
type: binary
- name: X331
type: binary
- name: X332
type: binary
- name: X333
type: binary
- name: X334
type: binary
- name: X335
type: binary
- name: X336
type: binary
- name: X337
type: binary
- name: X338
type: binary
- name: X339
type: binary
- name: X340
type: binary
- name: X341
type: binary
- name: X342
type: binary
- name: X343
type: binary
- name: X344
type: binary
- name: X345
type: binary
- name: X346
type: binary
- name: X347
type: binary
- name: X348
type: binary
- name: X349
type: binary
- name: X350
type: binary
- name: X351
type: binary
- name: X352
type: binary
- name: X353
type: binary
- name: X354
type: binary
- name: X355
type: binary
- name: X356
type: binary
- name: X357
type: binary
- name: X358
type: binary
- name: X359
type: binary
- name: X360
type: binary
- name: X361
type: binary
- name: X362
type: binary
- name: X363
type: binary
- name: X364
type: binary
- name: X365
type: binary
- name: X366
type: binary
- name: X367
type: binary
- name: X368
type: binary
- name: X369
type: binary
- name: X370
type: binary
- name: X371
type: binary
- name: X372
type: binary
- name: X373
type: binary
- name: X374
type: binary
- name: X375
type: binary
- name: X376
type: binary
- name: X377
type: binary
- name: X378
type: binary
- name: X379
type: binary
- name: X380
type: binary
- name: X382
type: binary
- name: X383
type: binary
- name: X384
type: binary
- name: X385
type: binary
combiner:
type: tabnet
size: 128 # N_a
output_size: 8 # N_d
sparsity: 0.1 # lambda_sparse
bn_momentum: 0.1 # m_B
num_steps: 9 # N_steps
relaxation_factor: 1.0 # gamma
bn_virtual_bs: 256 # B_v
trainer:
batch_size: 256 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.005
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 500
decay_rate: 0.95
validation_metric: root_mean_squared_error
- name: otto_group_product
goal: maximize
metric: accuracy
validation_metric_score: 0.7956883907318115
training_rows: 43459
test_rows: 12296
validation_rows: 6123
config:
output_features:
- name: target
type: category
input_features:
- name: feat_1
type: number
- name: feat_2
type: number
- name: feat_3
type: number
- name: feat_4
type: number
- name: feat_5
type: number
- name: feat_6
type: number
- name: feat_7
type: number
- name: feat_8
type: number
- name: feat_9
type: number
- name: feat_10
type: number
- name: feat_11
type: number
- name: feat_12
type: number
- name: feat_13
type: number
- name: feat_14
type: number
- name: feat_15
type: number
- name: feat_16
type: number
- name: feat_17
type: number
- name: feat_18
type: number
- name: feat_19
type: number
- name: feat_20
type: number
- name: feat_21
type: category
- name: feat_22
type: number
- name: feat_23
type: number
- name: feat_24
type: number
- name: feat_25
type: number
- name: feat_26
type: number
- name: feat_27
type: number
- name: feat_28
type: number
- name: feat_29
type: number
- name: feat_30
type: number
- name: feat_31
type: number
- name: feat_32
type: number
- name: feat_33
type: number
- name: feat_34
type: number
- name: feat_35
type: number
- name: feat_36
type: number
- name: feat_37
type: number
- name: feat_38
type: number
- name: feat_39
type: number
- name: feat_40
type: number
- name: feat_41
type: number
- name: feat_42
type: number
- name: feat_43
type: number
- name: feat_44
type: number
- name: feat_45
type: number
- name: feat_46
type: number
- name: feat_47
type: number
- name: feat_48
type: number
- name: feat_49
type: number
- name: feat_50
type: number
- name: feat_51
type: number
- name: feat_52
type: number
- name: feat_53
type: number
- name: feat_54
type: number
- name: feat_55
type: number
- name: feat_56
type: number
- name: feat_57
type: number
- name: feat_58
type: number
- name: feat_59
type: number
- name: feat_60
type: number
- name: feat_61
type: number
- name: feat_62
type: number
- name: feat_63
type: number
- name: feat_64
type: number
- name: feat_65
type: number
- name: feat_66
type: number
- name: feat_67
type: number
- name: feat_68
type: number
- name: feat_69
type: number
- name: feat_70
type: number
- name: feat_71
type: number
- name: feat_72
type: number
- name: feat_73
type: number
- name: feat_74
type: number
- name: feat_75
type: number
- name: feat_76
type: number
- name: feat_77
type: number
- name: feat_78
type: number
- name: feat_79
type: number
- name: feat_80
type: number
- name: feat_81
type: number
- name: feat_82
type: number
- name: feat_83
type: number
- name: feat_84
type: number
- name: feat_85
type: number
- name: feat_86
type: number
- name: feat_87
type: number
- name: feat_88
type: number
- name: feat_89
type: number
- name: feat_90
type: number
- name: feat_91
type: number
- name: feat_92
type: number
- name: feat_93
type: number
combiner:
type: tabnet
size: 128 # N_a
output_size: 128 # N_d
sparsity: 0.0 # lambda_sparse
bn_momentum: 0.2 # m_B
num_steps: 3 # N_steps
relaxation_factor: 1.0 # gamma
bn_virtual_bs: 512 # B_v
trainer:
batch_size: 256 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.005
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 20000
decay_rate: 0.4
validation_metric: accuracy
- name: poker_hand
goal: maximize
metric: accuracy
validation_metric_score: 0.9804078340530396
training_rows: 22509
test_rows: 0
validation_rows: 2501
config:
output_features:
- name: hand
type: category
input_features:
- name: S1
type: number
- name: C1
type: number
- name: S2
type: number
- name: C2
type: number
- name: S3
type: number
- name: C3
type: number
- name: S4
type: number
- name: C4
type: number
- name: S5
type: number
- name: C5
type: number
combiner:
type: tabnet
size: 16 # N_a
output_size: 128 # N_d
sparsity: 0.0 # lambda_sparse
bn_momentum: 0.02 # m_B
num_steps: 6 # N_steps
relaxation_factor: 1.0 # gamma
bn_virtual_bs: 512 # B_v
trainer:
batch_size: 256 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.01
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 8000
decay_rate: 0.8
validation_metric: accuracy
- name: porto_seguro_safe_driver
goal: maximize
metric: accuracy
validation_metric_score: 0.9630663394927979
training_rows: 416779
test_rows: 118948
validation_rows: 59485
config:
output_features:
- name: target
type: binary
input_features:
- name: ps_ind_01
type: category
- name: ps_ind_02_cat
type: number
- name: ps_ind_03
type: category
- name: ps_ind_04_cat
type: category
- name: ps_ind_05_cat
type: category
- name: ps_ind_06_bin
type: binary
- name: ps_ind_07_bin
type: binary
- name: ps_ind_08_bin
type: binary
- name: ps_ind_09_bin
type: binary
- name: ps_ind_10_bin
type: binary
- name: ps_ind_11_bin
type: binary
- name: ps_ind_12_bin
type: binary
- name: ps_ind_13_bin
type: binary
- name: ps_ind_14
type: category
- name: ps_ind_15
type: category
- name: ps_ind_16_bin
type: binary
- name: ps_ind_17_bin
type: binary
- name: ps_ind_18_bin
type: binary
- name: ps_reg_01
type: number
- name: ps_reg_02
type: number
- name: ps_reg_03
type: number
- name: ps_car_01_cat
type: category
- name: ps_car_02_cat
type: category
- name: ps_car_03_cat
type: category
- name: ps_car_04_cat
type: category
- name: ps_car_05_cat
type: category
- name: ps_car_06_cat
type: category
- name: ps_car_07_cat
type: category
- name: ps_car_08_cat
type: binary
- name: ps_car_09_cat
type: category
- name: ps_car_10_cat
type: category
- name: ps_car_11_cat
type: number
- name: ps_car_11
type: category
- name: ps_car_12
type: number
- name: ps_car_13
type: number
- name: ps_car_14
type: number
- name: ps_car_15
type: number
- name: ps_calc_01
type: number
- name: ps_calc_02
type: number
- name: ps_calc_03
type: number
- name: ps_calc_04
type: category
- name: ps_calc_05
type: category
- name: ps_calc_06
type: category
- name: ps_calc_07
type: category
- name: ps_calc_08
type: category
- name: ps_calc_09
type: category
- name: ps_calc_10
type: number
- name: ps_calc_11
type: number
- name: ps_calc_12
type: category
- name: ps_calc_13
type: category
- name: ps_calc_14
type: number
- name: ps_calc_15_bin
type: binary
- name: ps_calc_16_bin
type: binary
- name: ps_calc_17_bin
type: binary
- name: ps_calc_18_bin
type: binary
- name: ps_calc_19_bin
type: binary
- name: ps_calc_20_bin
type: binary
combiner:
type: tabnet
size: 32 # N_a
output_size: 32 # N_d
sparsity: 0.0001 # lambda_sparse
bn_momentum: 0.4 # m_B
num_steps: 5 # N_steps
relaxation_factor: 1.2 # gamma
bn_virtual_bs: 1024 # B_v
trainer:
batch_size: 1024 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.005
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 10000
decay_rate: 0.9
validation_metric: accuracy
- name: santander_customer_satisfaction
goal: maximize
metric: accuracy
validation_metric_score: 0.9611535668373108
training_rows: 53298
test_rows: 15128
validation_rows: 7594
config:
output_features:
- name: TARGET
type: binary
input_features:
- name: var3
type: number
- name: var15
type: number
- name: imp_ent_var16_ult1
type: number
- name: imp_op_var39_comer_ult1
type: number
- name: imp_op_var39_comer_ult3
type: number
- name: imp_op_var40_comer_ult1
type: number
- name: imp_op_var40_comer_ult3
type: number
- name: imp_op_var40_efect_ult1
type: number
- name: imp_op_var40_efect_ult3
type: number
- name: imp_op_var40_ult1
type: number
- name: imp_op_var41_comer_ult1
type: number
- name: imp_op_var41_comer_ult3
type: number
- name: imp_op_var41_efect_ult1
type: number
- name: imp_op_var41_efect_ult3
type: number
- name: imp_op_var41_ult1
type: number
- name: imp_op_var39_efect_ult1
type: number
- name: imp_op_var39_efect_ult3
type: number
- name: imp_op_var39_ult1
type: number
- name: imp_sal_var16_ult1
type: number
- name: ind_var1_0
type: binary
- name: ind_var1
type: binary
- name: ind_var2_0
type: binary
- name: ind_var2
type: binary
- name: ind_var5_0
type: binary
- name: ind_var5
type: binary
- name: ind_var6_0
type: binary
- name: ind_var6
type: binary
- name: ind_var8_0
type: binary
- name: ind_var8
type: binary
- name: ind_var12_0
type: binary
- name: ind_var12
type: binary
- name: ind_var13_0
type: binary
- name: ind_var13_corto_0
type: binary
- name: ind_var13_corto
type: binary
- name: ind_var13_largo_0
type: binary
- name: ind_var13_largo
type: binary
- name: ind_var13_medio_0
type: binary
- name: ind_var13_medio
type: binary
- name: ind_var13
type: binary
- name: ind_var14_0
type: binary
- name: ind_var14
type: binary
- name: ind_var17_0
type: binary
- name: ind_var17
type: binary
- name: ind_var18_0
type: binary
- name: ind_var18
type: binary
- name: ind_var19
type: binary
- name: ind_var20_0
type: binary
- name: ind_var20
type: binary
- name: ind_var24_0
type: binary
- name: ind_var24
type: binary
- name: ind_var25_cte
type: binary
- name: ind_var26_0
type: binary
- name: ind_var26_cte
type: binary
- name: ind_var26
type: binary
- name: ind_var25_0
type: binary
- name: ind_var25
type: binary
- name: ind_var27_0
type: binary
- name: ind_var28_0
type: binary
- name: ind_var28
type: binary
- name: ind_var27
type: binary
- name: ind_var29_0
type: binary
- name: ind_var29
type: binary
- name: ind_var30_0
type: binary
- name: ind_var30
type: binary
- name: ind_var31_0
type: binary
- name: ind_var31
type: binary
- name: ind_var32_cte
type: binary
- name: ind_var32_0
type: binary
- name: ind_var32
type: binary
- name: ind_var33_0
type: binary
- name: ind_var33
type: binary
- name: ind_var34_0
type: binary
- name: ind_var34
type: binary
- name: ind_var37_cte
type: binary
- name: ind_var37_0
type: binary
- name: ind_var37
type: binary
- name: ind_var39_0
type: binary
- name: ind_var40_0
type: binary
- name: ind_var40
type: binary
- name: ind_var41_0
type: binary
- name: ind_var41
type: binary
- name: ind_var39
type: binary
- name: ind_var44_0
type: binary
- name: ind_var44
type: binary
- name: ind_var46_0
type: binary
- name: ind_var46
type: binary
- name: num_var1_0
type: number
- name: num_var1
type: number
- name: num_var4
type: category
- name: num_var5_0
type: number
- name: num_var5
type: number
- name: num_var6_0
type: number
- name: num_var6
type: number
- name: num_var8_0
type: number
- name: num_var8
type: number
- name: num_var12_0
type: number
- name: num_var12
type: number
- name: num_var13_0
type: number
- name: num_var13_corto_0
type: number
- name: num_var13_corto
type: number
- name: num_var13_largo_0
type: number
- name: num_var13_largo
type: number
- name: num_var13_medio_0
type: number
- name: num_var13_medio
type: number
- name: num_var13
type: number
- name: num_var14_0
type: number
- name: num_var14
type: number
- name: num_var17_0
type: number
- name: num_var17
type: number
- name: num_var18_0
type: number
- name: num_var18
type: number
- name: num_var20_0
type: number
- name: num_var20
type: number
- name: num_var24_0
type: number
- name: num_var24
type: number
- name: num_var26_0
type: number
- name: num_var26
type: number
- name: num_var25_0
type: number
- name: num_var25
type: number
- name: num_op_var40_hace2
type: number
- name: num_op_var40_hace3
type: number
- name: num_op_var40_ult1
type: number
- name: num_op_var40_ult3
type: number
- name: num_op_var41_hace2
type: number
- name: num_op_var41_hace3
type: number
- name: num_op_var41_ult1
type: number
- name: num_op_var41_ult3
type: number
- name: num_op_var39_hace2
type: number
- name: num_op_var39_hace3
type: number
- name: num_op_var39_ult1
type: number
- name: num_op_var39_ult3
type: number
- name: num_var27_0
type: binary
- name: num_var28_0
type: binary
- name: num_var28
type: binary
- name: num_var27
type: binary
- name: num_var29_0
type: number
- name: num_var29
type: number
- name: num_var30_0
type: number
- name: num_var30
type: number
- name: num_var31_0
type: number
- name: num_var31
type: number
- name: num_var32_0
type: number
- name: num_var32
type: number
- name: num_var33_0
type: number
- name: num_var33
type: number
- name: num_var34_0
type: number
- name: num_var34
type: number
- name: num_var35
type: number
- name: num_var37_med_ult2
type: number
- name: num_var37_0
type: number
- name: num_var37
type: number
- name: num_var39_0
type: number
- name: num_var40_0
type: number
- name: num_var40
type: number
- name: num_var41_0
type: number
- name: num_var41
type: binary
- name: num_var39
type: number
- name: num_var42_0
type: number
- name: num_var42
type: number
- name: num_var44_0
type: number
- name: num_var44
type: number
- name: num_var46_0
type: binary
- name: num_var46
type: binary
- name: saldo_var1
type: number
- name: saldo_var5
type: number
- name: saldo_var6
type: number
- name: saldo_var8
type: number
- name: saldo_var12
type: number
- name: saldo_var13_corto
type: number
- name: saldo_var13_largo
type: number
- name: saldo_var13_medio
type: number
- name: saldo_var13
type: number
- name: saldo_var14
type: number
- name: saldo_var17
type: number
- name: saldo_var18
type: number
- name: saldo_var20
type: number
- name: saldo_var24
type: number
- name: saldo_var26
type: number
- name: saldo_var25
type: number
- name: saldo_var28
type: binary
- name: saldo_var27
type: binary
- name: saldo_var29
type: number
- name: saldo_var30
type: number
- name: saldo_var31
type: number
- name: saldo_var32
type: number
- name: saldo_var33
type: number
- name: saldo_var34
type: number
- name: saldo_var37
type: number
- name: saldo_var40
type: number
- name: saldo_var41
type: binary
- name: saldo_var42
type: number
- name: saldo_var44
type: number
- name: saldo_var46
type: binary
- name: var36
type: number
- name: delta_imp_amort_var18_1y3
type: number
- name: delta_imp_amort_var34_1y3
type: number
- name: delta_imp_aport_var13_1y3
type: number
- name: delta_imp_aport_var17_1y3
type: number
- name: delta_imp_aport_var33_1y3
type: number
- name: delta_imp_compra_var44_1y3
type: number
- name: delta_imp_reemb_var13_1y3
type: number
- name: delta_imp_reemb_var17_1y3
type: number
- name: delta_imp_reemb_var33_1y3
type: number
- name: delta_imp_trasp_var17_in_1y3
type: number
- name: delta_imp_trasp_var17_out_1y3
type: number
- name: delta_imp_trasp_var33_in_1y3
type: number
- name: delta_imp_trasp_var33_out_1y3
type: number
- name: delta_imp_venta_var44_1y3
type: number
- name: delta_num_aport_var13_1y3
type: number
- name: delta_num_aport_var17_1y3
type: number
- name: delta_num_aport_var33_1y3
type: number
- name: delta_num_compra_var44_1y3
type: number
- name: delta_num_reemb_var13_1y3
type: number
- name: delta_num_reemb_var17_1y3
type: number
- name: delta_num_reemb_var33_1y3
type: number
- name: delta_num_trasp_var17_in_1y3
type: number
- name: delta_num_trasp_var17_out_1y3
type: number
- name: delta_num_trasp_var33_in_1y3
type: number
- name: delta_num_trasp_var33_out_1y3
type: number
- name: delta_num_venta_var44_1y3
type: number
- name: imp_amort_var18_hace3
type: binary
- name: imp_amort_var18_ult1
type: number
- name: imp_amort_var34_hace3
type: binary
- name: imp_amort_var34_ult1
type: number
- name: imp_aport_var13_hace3
type: number
- name: imp_aport_var13_ult1
type: number
- name: imp_aport_var17_hace3
type: number
- name: imp_aport_var17_ult1
type: number
- name: imp_aport_var33_hace3
type: number
- name: imp_aport_var33_ult1
type: number
- name: imp_var7_emit_ult1
type: number
- name: imp_var7_recib_ult1
type: number
- name: imp_compra_var44_hace3
type: number
- name: imp_compra_var44_ult1
type: number
- name: imp_reemb_var13_hace3
type: binary
- name: imp_reemb_var13_ult1
type: number
- name: imp_reemb_var17_hace3
type: number
- name: imp_reemb_var17_ult1
type: number
- name: imp_reemb_var33_hace3
type: binary
- name: imp_reemb_var33_ult1
type: number
- name: imp_var43_emit_ult1
type: number
- name: imp_trans_var37_ult1
type: number
- name: imp_trasp_var17_in_hace3
type: number
- name: imp_trasp_var17_in_ult1
type: number
- name: imp_trasp_var17_out_hace3
type: binary
- name: imp_trasp_var17_out_ult1
type: number
- name: imp_trasp_var33_in_hace3
type: number
- name: imp_trasp_var33_in_ult1
type: number
- name: imp_trasp_var33_out_hace3
type: binary
- name: imp_trasp_var33_out_ult1
type: number
- name: imp_venta_var44_hace3
type: number
- name: imp_venta_var44_ult1
type: number
- name: ind_var7_emit_ult1
type: binary
- name: ind_var7_recib_ult1
type: binary
- name: ind_var10_ult1
type: binary
- name: ind_var10cte_ult1
type: binary
- name: ind_var9_cte_ult1
type: binary
- name: ind_var9_ult1
type: binary
- name: ind_var43_emit_ult1
type: binary
- name: ind_var43_recib_ult1
type: binary
- name: var21
type: number
- name: num_var2_0_ult1
type: binary
- name: num_var2_ult1
type: binary
- name: num_aport_var13_hace3
type: number
- name: num_aport_var13_ult1
type: number
- name: num_aport_var17_hace3
type: number
- name: num_aport_var17_ult1
type: number
- name: num_aport_var33_hace3
type: number
- name: num_aport_var33_ult1
type: number
- name: num_var7_emit_ult1
type: number
- name: num_var7_recib_ult1
type: number
- name: num_compra_var44_hace3
type: number
- name: num_compra_var44_ult1
type: number
- name: num_ent_var16_ult1
type: number
- name: num_var22_hace2
type: number
- name: num_var22_hace3
type: number
- name: num_var22_ult1
type: number
- name: num_var22_ult3
type: number
- name: num_med_var22_ult3
type: number
- name: num_med_var45_ult3
type: number
- name: num_meses_var5_ult3
type: category
- name: num_meses_var8_ult3
type: category
- name: num_meses_var12_ult3
type: category
- name: num_meses_var13_corto_ult3
type: category
- name: num_meses_var13_largo_ult3
type: category
- name: num_meses_var13_medio_ult3
type: number
- name: num_meses_var17_ult3
type: category
- name: num_meses_var29_ult3
type: category
- name: num_meses_var33_ult3
type: category
- name: num_meses_var39_vig_ult3
type: category
- name: num_meses_var44_ult3
type: category
- name: num_op_var39_comer_ult1
type: number
- name: num_op_var39_comer_ult3
type: number
- name: num_op_var40_comer_ult1
type: number
- name: num_op_var40_comer_ult3
type: number
- name: num_op_var40_efect_ult1
type: number
- name: num_op_var40_efect_ult3
type: number
- name: num_op_var41_comer_ult1
type: number
- name: num_op_var41_comer_ult3
type: number
- name: num_op_var41_efect_ult1
type: number
- name: num_op_var41_efect_ult3
type: number
- name: num_op_var39_efect_ult1
type: number
- name: num_op_var39_efect_ult3
type: number
- name: num_reemb_var13_hace3
type: binary
- name: num_reemb_var13_ult1
type: number
- name: num_reemb_var17_hace3
type: number
- name: num_reemb_var17_ult1
type: number
- name: num_reemb_var33_hace3
type: binary
- name: num_reemb_var33_ult1
type: number
- name: num_sal_var16_ult1
type: number
- name: num_var43_emit_ult1
type: number
- name: num_var43_recib_ult1
type: number
- name: num_trasp_var11_ult1
type: number
- name: num_trasp_var17_in_hace3
type: number
- name: num_trasp_var17_in_ult1
type: number
- name: num_trasp_var17_out_hace3
type: binary
- name: num_trasp_var17_out_ult1
type: number
- name: num_trasp_var33_in_hace3
type: number
- name: num_trasp_var33_in_ult1
type: number
- name: num_trasp_var33_out_hace3
type: binary
- name: num_trasp_var33_out_ult1
type: number
- name: num_venta_var44_hace3
type: number
- name: num_venta_var44_ult1
type: number
- name: num_var45_hace2
type: number
- name: num_var45_hace3
type: number
- name: num_var45_ult1
type: number
- name: num_var45_ult3
type: number
- name: saldo_var2_ult1
type: binary
- name: saldo_medio_var5_hace2
type: number
- name: saldo_medio_var5_hace3
type: number
- name: saldo_medio_var5_ult1
type: number
- name: saldo_medio_var5_ult3
type: number
- name: saldo_medio_var8_hace2
type: number
- name: saldo_medio_var8_hace3
type: number
- name: saldo_medio_var8_ult1
type: number
- name: saldo_medio_var8_ult3
type: number
- name: saldo_medio_var12_hace2
type: number
- name: saldo_medio_var12_hace3
type: number
- name: saldo_medio_var12_ult1
type: number
- name: saldo_medio_var12_ult3
type: number
- name: saldo_medio_var13_corto_hace2
type: number
- name: saldo_medio_var13_corto_hace3
type: number
- name: saldo_medio_var13_corto_ult1
type: number
- name: saldo_medio_var13_corto_ult3
type: number
- name: saldo_medio_var13_largo_hace2
type: number
- name: saldo_medio_var13_largo_hace3
type: number
- name: saldo_medio_var13_largo_ult1
type: number
- name: saldo_medio_var13_largo_ult3
type: number
- name: saldo_medio_var13_medio_hace2
type: number
- name: saldo_medio_var13_medio_hace3
type: binary
- name: saldo_medio_var13_medio_ult1
type: number
- name: saldo_medio_var13_medio_ult3
type: number
- name: saldo_medio_var17_hace2
type: number
- name: saldo_medio_var17_hace3
type: number
- name: saldo_medio_var17_ult1
type: number
- name: saldo_medio_var17_ult3
type: number
- name: saldo_medio_var29_hace2
type: number
- name: saldo_medio_var29_hace3
type: number
- name: saldo_medio_var29_ult1
type: number
- name: saldo_medio_var29_ult3
type: number
- name: saldo_medio_var33_hace2
type: number
- name: saldo_medio_var33_hace3
type: number
- name: saldo_medio_var33_ult1
type: number
- name: saldo_medio_var33_ult3
type: number
- name: saldo_medio_var44_hace2
type: number
- name: saldo_medio_var44_hace3
type: number
- name: saldo_medio_var44_ult1
type: number
- name: saldo_medio_var44_ult3
type: number
- name: var38
type: number
combiner:
type: tabnet
size: 24 # N_a
output_size: 128 # N_d
sparsity: 0.001 # lambda_sparse
bn_momentum: 0.2 # m_B
num_steps: 7 # N_steps
relaxation_factor: 1.2 # gamma
bn_virtual_bs: 256 # B_v
trainer:
batch_size: 4096 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.005
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 10000
decay_rate: 0.8
validation_metric: accuracy
- name: santander_customer_transaction
goal: maximize
metric: accuracy
validation_metric_score: 0.9150915145874023
training_rows: 139904
test_rows: 40098
validation_rows: 19998
config:
output_features:
- name: target
type: binary
input_features:
- name: var_0
type: number
- name: var_1
type: number
- name: var_2
type: number
- name: var_3
type: number
- name: var_4
type: number
- name: var_5
type: number
- name: var_6
type: number
- name: var_7
type: number
- name: var_8
type: number
- name: var_9
type: number
- name: var_10
type: number
- name: var_11
type: number
- name: var_12
type: number
- name: var_13
type: number
- name: var_14
type: number
- name: var_15
type: number
- name: var_16
type: number
- name: var_17
type: number
- name: var_18
type: number
- name: var_19
type: number
- name: var_20
type: number
- name: var_21
type: number
- name: var_22
type: number
- name: var_23
type: number
- name: var_24
type: number
- name: var_25
type: number
- name: var_26
type: number
- name: var_27
type: number
- name: var_28
type: number
- name: var_29
type: number
- name: var_30
type: number
- name: var_31
type: number
- name: var_32
type: number
- name: var_33
type: number
- name: var_34
type: number
- name: var_35
type: number
- name: var_36
type: number
- name: var_37
type: number
- name: var_38
type: number
- name: var_39
type: number
- name: var_40
type: number
- name: var_41
type: number
- name: var_42
type: number
- name: var_43
type: number
- name: var_44
type: number
- name: var_45
type: number
- name: var_46
type: number
- name: var_47
type: number
- name: var_48
type: number
- name: var_49
type: number
- name: var_50
type: number
- name: var_51
type: number
- name: var_52
type: number
- name: var_53
type: number
- name: var_54
type: number
- name: var_55
type: number
- name: var_56
type: number
- name: var_57
type: number
- name: var_58
type: number
- name: var_59
type: number
- name: var_60
type: number
- name: var_61
type: number
- name: var_62
type: number
- name: var_63
type: number
- name: var_64
type: number
- name: var_65
type: number
- name: var_66
type: number
- name: var_67
type: number
- name: var_68
type: number
- name: var_69
type: number
- name: var_70
type: number
- name: var_71
type: number
- name: var_72
type: number
- name: var_73
type: number
- name: var_74
type: number
- name: var_75
type: number
- name: var_76
type: number
- name: var_77
type: number
- name: var_78
type: number
- name: var_79
type: number
- name: var_80
type: number
- name: var_81
type: number
- name: var_82
type: number
- name: var_83
type: number
- name: var_84
type: number
- name: var_85
type: number
- name: var_86
type: number
- name: var_87
type: number
- name: var_88
type: number
- name: var_89
type: number
- name: var_90
type: number
- name: var_91
type: number
- name: var_92
type: number
- name: var_93
type: number
- name: var_94
type: number
- name: var_95
type: number
- name: var_96
type: number
- name: var_97
type: number
- name: var_98
type: number
- name: var_99
type: number
- name: var_100
type: number
- name: var_101
type: number
- name: var_102
type: number
- name: var_103
type: number
- name: var_104
type: number
- name: var_105
type: number
- name: var_106
type: number
- name: var_107
type: number
- name: var_108
type: number
- name: var_109
type: number
- name: var_110
type: number
- name: var_111
type: number
- name: var_112
type: number
- name: var_113
type: number
- name: var_114
type: number
- name: var_115
type: number
- name: var_116
type: number
- name: var_117
type: number
- name: var_118
type: number
- name: var_119
type: number
- name: var_120
type: number
- name: var_121
type: number
- name: var_122
type: number
- name: var_123
type: number
- name: var_124
type: number
- name: var_125
type: number
- name: var_126
type: number
- name: var_127
type: number
- name: var_128
type: number
- name: var_129
type: number
- name: var_130
type: number
- name: var_131
type: number
- name: var_132
type: number
- name: var_133
type: number
- name: var_134
type: number
- name: var_135
type: number
- name: var_136
type: number
- name: var_137
type: number
- name: var_138
type: number
- name: var_139
type: number
- name: var_140
type: number
- name: var_141
type: number
- name: var_142
type: number
- name: var_143
type: number
- name: var_144
type: number
- name: var_145
type: number
- name: var_146
type: number
- name: var_147
type: number
- name: var_148
type: number
- name: var_149
type: number
- name: var_150
type: number
- name: var_151
type: number
- name: var_152
type: number
- name: var_153
type: number
- name: var_154
type: number
- name: var_155
type: number
- name: var_156
type: number
- name: var_157
type: number
- name: var_158
type: number
- name: var_159
type: number
- name: var_160
type: number
- name: var_161
type: number
- name: var_162
type: number
- name: var_163
type: number
- name: var_164
type: number
- name: var_165
type: number
- name: var_166
type: number
- name: var_167
type: number
- name: var_168
type: number
- name: var_169
type: number
- name: var_170
type: number
- name: var_171
type: number
- name: var_172
type: number
- name: var_173
type: number
- name: var_174
type: number
- name: var_175
type: number
- name: var_176
type: number
- name: var_177
type: number
- name: var_178
type: number
- name: var_179
type: number
- name: var_180
type: number
- name: var_181
type: number
- name: var_182
type: number
- name: var_183
type: number
- name: var_184
type: number
- name: var_185
type: number
- name: var_186
type: number
- name: var_187
type: number
- name: var_188
type: number
- name: var_189
type: number
- name: var_190
type: number
- name: var_191
type: number
- name: var_192
type: number
- name: var_193
type: number
- name: var_194
type: number
- name: var_195
type: number
- name: var_196
type: number
- name: var_197
type: number
- name: var_198
type: number
- name: var_199
type: number
combiner:
type: tabnet
size: 8 # N_a
output_size: 8 # N_d
sparsity: 0.0 # lambda_sparse
bn_momentum: 0.4 # m_B
num_steps: 3 # N_steps
relaxation_factor: 2.0 # gamma
bn_virtual_bs: 256 # B_v
trainer:
batch_size: 256 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.005
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 20000
decay_rate: 0.95
validation_metric: accuracy
- name: sarcos
goal: minimize
metric: root_mean_squared_error
validation_metric_score: 2.0124664306640625
training_rows: 40036
test_rows: 0
validation_rows: 4448
config:
output_features:
- name: torque_1
type: number
input_features:
- name: position_1
type: number
- name: position_2
type: number
- name: position_3
type: number
- name: position_4
type: number
- name: position_5
type: number
- name: position_6
type: number
- name: position_7
type: number
- name: velocity_1
type: number
- name: velocity_2
type: number
- name: velocity_3
type: number
- name: velocity_4
type: number
- name: velocity_5
type: number
- name: velocity_6
type: number
- name: velocity_7
type: number
- name: acceleration_1
type: number
- name: acceleration_2
type: number
- name: acceleration_3
type: number
- name: acceleration_4
type: number
- name: acceleration_5
type: number
- name: acceleration_6
type: number
- name: acceleration_7
type: number
combiner:
type: tabnet
size: 128 # N_a
output_size: 8 # N_d
sparsity: 0.000001 # lambda_sparse
bn_momentum: 0.02 # m_B
num_steps: 4 # N_steps
relaxation_factor: 1.2 # gamma
bn_virtual_bs: 4096 # B_v
trainer:
batch_size: 256 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.005
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 20000
decay_rate: 0.4
validation_metric: root_mean_squared_error
- name: walmart_recruiting
goal: maximize
metric: accuracy
validation_metric_score: 0.31689465045928955
training_rows: 453154
test_rows: 129276
validation_rows: 64624
config:
output_features:
- name: TripType
type: category
input_features:
- name: VisitNumber
type: number
- name: Weekday
type: category
- name: Upc
type: number
- name: ScanCount
type: number
- name: FinelineNumber
type: number
combiner:
type: tabnet
size: 32 # N_a
output_size: 128 # N_d
sparsity: 0.000001 # lambda_sparse
bn_momentum: 0.4 # m_B
num_steps: 4 # N_steps
relaxation_factor: 1.2 # gamma
bn_virtual_bs: 4096 # B_v
trainer:
batch_size: 8192 # B
eval_batch_size: null # 65536 131072 262144 524288
epochs: 300
early_stop: 30
learning_rate: 0.01
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_steps: 20000
decay_rate: 0.9
validation_metric: accuracy
================================================
FILE: ludwig/automl/defaults/text/bert_config.yaml
================================================
trainer:
epochs: 10
learning_rate_scheduler:
warmup_fraction: 0.1
decay: linear
optimizer:
type: adamw
use_mixed_precision: true
defaults:
text:
encoder:
type: bert
trainable: true
hyperopt:
# goal: maximize
parameters:
# This parameter space was updated to be loguniform because of issues merging with the trainer.learning_rate
# parameter space in ludwig/automl/defaults/combiner/concat_config.yaml. Doing automl on a text feature would
# create an invalid combination of loguniform and choice paramters.
# TODO(jeffkinnison): Add a second pass `merge_dicts` to handle parameter spaces
trainer.learning_rate:
space: loguniform
lower: 0.00002
upper: 0.00003
trainer.batch_size:
space: choice
categories: [16, 32, 64, 128]
================================================
FILE: ludwig/backend/__init__.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import contextlib
import logging
import os
from ludwig.api_annotations import DeveloperAPI
from ludwig.backend.base import Backend, LocalBackend
logger = logging.getLogger(__name__)
# TODO: remove LOCAL_BACKEND as a global constant, replace with singleton LocalBackend.shared_instance().
LOCAL_BACKEND = LocalBackend.shared_instance()
LOCAL = "local"
DASK = "dask"
DEEPSPEED = "deepspeed"
RAY = "ray"
ALL_BACKENDS = [LOCAL, DASK, DEEPSPEED, RAY]
def _has_ray():
# Temporary workaround to prevent tests from automatically using the Ray backend. Taken from
# https://stackoverflow.com/questions/25188119/test-if-code-is-executed-from-within-a-py-test-session
if "PYTEST_CURRENT_TEST" in os.environ:
return False
try:
import ray
except ImportError:
return False
if ray.is_initialized():
return True
try:
ray.init("auto", ignore_reinit_error=True)
return True
except Exception:
return False
def get_local_backend(**kwargs):
return LocalBackend(**kwargs)
def create_deepspeed_backend(**kwargs):
from ludwig.backend.deepspeed import DeepSpeedBackend
return DeepSpeedBackend(**kwargs)
def create_ray_backend(**kwargs):
from ludwig.backend.ray import RayBackend
return RayBackend(**kwargs)
backend_registry = {
LOCAL: get_local_backend,
DEEPSPEED: create_deepspeed_backend,
RAY: create_ray_backend,
None: get_local_backend,
}
@DeveloperAPI
def create_backend(type, **kwargs):
if isinstance(type, Backend):
return type
if type is None and _has_ray():
type = RAY
return backend_registry[type](**kwargs)
@DeveloperAPI
def initialize_backend(backend):
if isinstance(backend, dict):
backend = create_backend(**backend)
else:
backend = create_backend(backend)
backend.initialize()
return backend
@contextlib.contextmanager
def provision_preprocessing_workers(backend):
if backend.BACKEND_TYPE == RAY:
with backend.provision_preprocessing_workers():
yield
else:
yield
================================================
FILE: ludwig/backend/base.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from __future__ import annotations
import time
from abc import ABC, abstractmethod
from collections.abc import Callable, Generator
from concurrent.futures import ThreadPoolExecutor
from contextlib import contextmanager
from typing import Any, TYPE_CHECKING
import numpy as np
import pandas as pd
import psutil
import torch
from tqdm import tqdm
from ludwig.api_annotations import DeveloperAPI
from ludwig.backend.utils.storage import StorageManager
from ludwig.constants import MODEL_LLM
from ludwig.data.cache.manager import CacheManager
from ludwig.data.dataframe.base import DataFrameEngine
from ludwig.data.dataframe.pandas import PANDAS
from ludwig.data.dataset.base import DatasetManager
from ludwig.data.dataset.pandas import PandasDatasetManager
from ludwig.distributed import init_dist_strategy
from ludwig.distributed.base import DistributedStrategy
from ludwig.models.base import BaseModel
from ludwig.schema.trainer import BaseTrainerConfig
from ludwig.types import HyperoptConfigDict
from ludwig.utils.audio_utils import read_audio_from_path
from ludwig.utils.batch_size_tuner import BatchSizeEvaluator
from ludwig.utils.dataframe_utils import from_batches, to_batches
from ludwig.utils.fs_utils import get_bytes_obj_from_path
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.system_utils import Resources
from ludwig.utils.torch_utils import initialize_pytorch
from ludwig.utils.types import DataFrame, Series
if TYPE_CHECKING:
from ludwig.trainers.base import BaseTrainer
@DeveloperAPI
class Backend(ABC):
def __init__(
self,
dataset_manager: DatasetManager,
cache_dir: str | None = None,
credentials: dict[str, dict[str, Any]] | None = None,
):
credentials = credentials or {}
self._dataset_manager = dataset_manager
self._storage_manager = StorageManager(**credentials)
self._cache_manager = CacheManager(self._dataset_manager, cache_dir)
@property
def storage(self) -> StorageManager:
return self._storage_manager
@property
def cache(self) -> CacheManager:
return self._cache_manager
@property
def dataset_manager(self) -> DatasetManager:
return self._dataset_manager
@abstractmethod
def initialize(self):
raise NotImplementedError()
@abstractmethod
def initialize_pytorch(self, *args, **kwargs):
raise NotImplementedError()
@contextmanager
@abstractmethod
def create_trainer(self, config: BaseTrainerConfig, model: BaseModel, **kwargs) -> Generator:
raise NotImplementedError()
@abstractmethod
def sync_model(self, model):
raise NotImplementedError()
@abstractmethod
def broadcast_return(self, fn):
raise NotImplementedError()
@abstractmethod
def is_coordinator(self):
raise NotImplementedError()
@property
@abstractmethod
def df_engine(self) -> DataFrameEngine:
raise NotImplementedError()
@property
@abstractmethod
def supports_multiprocessing(self):
raise NotImplementedError()
@abstractmethod
def read_binary_files(self, column: Series, map_fn: Callable | None = None) -> Series:
raise NotImplementedError()
@property
@abstractmethod
def num_nodes(self) -> int:
raise NotImplementedError()
@property
@abstractmethod
def num_training_workers(self) -> int:
raise NotImplementedError()
@abstractmethod
def get_available_resources(self) -> Resources:
raise NotImplementedError()
@abstractmethod
def max_concurrent_trials(self, hyperopt_config: HyperoptConfigDict) -> int | None:
raise NotImplementedError()
@abstractmethod
def tune_batch_size(self, evaluator_cls: type[BatchSizeEvaluator], dataset_len: int) -> int:
"""Returns best batch size (measured in samples / s) on the given evaluator.
The evaluator class will need to be instantiated on each worker in the backend cluster, then call
`evaluator.select_best_batch_size(dataset_len)`.
"""
raise NotImplementedError()
@abstractmethod
def batch_transform(
self, df: DataFrame, batch_size: int, transform_fn: Callable, name: str | None = None
) -> DataFrame:
"""Applies `transform_fn` to every `batch_size` length batch of `df` and returns the result."""
raise NotImplementedError()
def supports_batch_size_tuning(self) -> bool:
return True
class LocalPreprocessingMixin:
@property
def df_engine(self):
return PANDAS
@property
def supports_multiprocessing(self):
return True
@staticmethod
def read_binary_files(column: pd.Series, map_fn: Callable | None = None, file_size: int | None = None) -> pd.Series:
column = column.fillna(np.nan).replace([np.nan], [None]) # normalize NaNs to None
sample_fname = column.head(1).values[0]
with ThreadPoolExecutor() as executor: # number of threads is inferred
if isinstance(sample_fname, str):
if map_fn is read_audio_from_path: # bypass torchaudio issue that no longer takes in file-like objects
result = executor.map( # type: ignore[misc]
lambda path: map_fn(path) if path is not None else path, column.values
)
else:
result = executor.map(
lambda path: get_bytes_obj_from_path(path) if path is not None else path, column.values
)
else:
# If the sample path is not a string, assume the paths has already been read in
result = column.values
if map_fn is not None and map_fn is not read_audio_from_path:
result = executor.map(map_fn, result)
return pd.Series(result, index=column.index, name=column.name)
@staticmethod
def batch_transform(df: DataFrame, batch_size: int, transform_fn: Callable, name: str | None = None) -> DataFrame:
name = name or "Batch Transform"
batches = to_batches(df, batch_size)
transform = transform_fn()
out_batches = [transform(batch.reset_index(drop=True)) for batch in tqdm(batches, desc=name)]
out_df = from_batches(out_batches).reset_index(drop=True)
return out_df
class LocalTrainingMixin:
@staticmethod
def initialize():
init_dist_strategy("local")
@staticmethod
def initialize_pytorch(*args, **kwargs):
initialize_pytorch(*args, **kwargs)
@staticmethod
def create_predictor(model: BaseModel, **kwargs):
from ludwig.models.predictor import get_predictor_cls
return get_predictor_cls(model.type())(model, **kwargs) # type: ignore[call-arg]
def sync_model(self, model):
pass
@staticmethod
def broadcast_return(fn):
return fn()
@staticmethod
def is_coordinator() -> bool:
return True
@staticmethod
def tune_batch_size(evaluator_cls: type[BatchSizeEvaluator], dataset_len: int) -> int:
evaluator = evaluator_cls()
return evaluator.select_best_batch_size(dataset_len)
class RemoteTrainingMixin:
def sync_model(self, model):
pass
@staticmethod
def broadcast_return(fn):
return fn()
@staticmethod
def is_coordinator() -> bool:
return True
@DeveloperAPI
class LocalBackend(LocalPreprocessingMixin, LocalTrainingMixin, Backend):
BACKEND_TYPE = "local"
_shared_instance: LocalBackend
@classmethod
def shared_instance(cls) -> LocalBackend:
"""Returns a shared singleton LocalBackend instance."""
if not hasattr(cls, "_shared_instance"):
cls._shared_instance = cls()
return cls._shared_instance
def __init__(self, **kwargs) -> None:
super().__init__(dataset_manager=PandasDatasetManager(self), **kwargs)
@property
def num_nodes(self) -> int:
return 1
@property
def num_training_workers(self) -> int:
return 1
def get_available_resources(self) -> Resources:
return Resources(cpus=psutil.cpu_count(), gpus=torch.cuda.device_count())
def max_concurrent_trials(self, hyperopt_config: HyperoptConfigDict) -> int | None:
# Every trial will be run with Pandas and NO Ray Datasets. Allow Ray Tune to use all the
# trial resources it wants, because there is no Ray Datasets process to compete with it for CPUs.
return None
def create_trainer(
self,
config: BaseTrainerConfig,
model: BaseModel,
**kwargs,
) -> BaseTrainer: # type: ignore[override]
from ludwig.trainers.registry import get_llm_trainers_registry, get_trainers_registry
trainer_cls: type
if model.type() == MODEL_LLM:
trainer_cls = get_from_registry(config.type, get_llm_trainers_registry())
else:
trainer_cls = get_from_registry(model.type(), get_trainers_registry())
return trainer_cls(config=config, model=model, **kwargs)
@DeveloperAPI
class DataParallelBackend(LocalPreprocessingMixin, Backend, ABC):
BACKEND_TYPE = "deepspeed"
def __init__(self, **kwargs):
super().__init__(dataset_manager=PandasDatasetManager(self), **kwargs)
self._distributed: DistributedStrategy | None = None
@abstractmethod
def initialize(self):
pass
def initialize_pytorch(self, *args, **kwargs):
initialize_pytorch(
*args, local_rank=self._distributed.local_rank(), local_size=self._distributed.local_size(), **kwargs
)
def create_trainer(
self,
config: BaseTrainerConfig,
model: BaseModel,
**kwargs,
) -> BaseTrainer: # type: ignore[override]
from ludwig.trainers.trainer import Trainer
return Trainer(config, model, distributed=self._distributed, **kwargs)
def create_predictor(self, model: BaseModel, **kwargs):
from ludwig.models.predictor import get_predictor_cls
return get_predictor_cls(model.type())(model, distributed=self._distributed, **kwargs) # type: ignore[call-arg]
def sync_model(self, model):
# Model weights are only saved on the coordinator, so broadcast
# to all other ranks
self._distributed.sync_model(model)
def broadcast_return(self, fn):
"""Returns the result of calling `fn` on coordinator, broadcast to all other ranks.
Specifically, `fn` is only executed on coordinator, but its result is returned by every rank by broadcasting the
return value from coordinator.
"""
result = fn() if self.is_coordinator() else None
if self._distributed:
name = f"broadcast_return_{int(time.time())}"
result = self._distributed.broadcast_object(result, name=name)
return result
def is_coordinator(self):
return self._distributed.rank() == 0
@property
def num_nodes(self) -> int:
return self._distributed.size() // self._distributed.local_size()
@property
def num_training_workers(self) -> int:
return self._distributed.size()
def get_available_resources(self) -> Resources:
# TODO(travis): this double-counts on the same device, it should use a cross-communicator instead
cpus = torch.as_tensor([psutil.cpu_count()], dtype=torch.int)
cpus = self._distributed.allreduce(cpus).item()
gpus = torch.as_tensor([torch.cuda.device_count()], dtype=torch.int)
gpus = self._distributed.allreduce(gpus).item()
return Resources(cpus=cpus, gpus=gpus)
def max_concurrent_trials(self, hyperopt_config: HyperoptConfigDict) -> int | None:
# Return None since there is no Ray component
return None
def tune_batch_size(self, evaluator_cls: type[BatchSizeEvaluator], dataset_len: int) -> int:
evaluator = evaluator_cls()
return evaluator.select_best_batch_size(dataset_len)
================================================
FILE: ludwig/backend/datasource.py
================================================
"""Custom Ray datasource utilities for reading binary files with None handling."""
import logging
from typing import Optional, TYPE_CHECKING
import pandas as pd
import ray
import urllib3
from ludwig.utils.fs_utils import get_bytes_obj_from_http_path, is_http
if TYPE_CHECKING:
import pyarrow
logger = logging.getLogger(__name__)
def read_binary_files_with_index(
paths_and_idxs: list[tuple[str | None, int]],
filesystem: Optional["pyarrow.fs.FileSystem"] = None,
) -> "ray.data.Dataset":
"""Read binary files into a Ray Dataset, handling None paths and HTTP URLs.
Each row in the resulting dataset has columns:
- "data": the raw bytes of the file (or None if path was None/failed)
- "idx": the original index for reordering
Args:
paths_and_idxs: List of (path, index) tuples. Path can be None.
filesystem: PyArrow filesystem for reading non-HTTP files.
Returns:
A ray.data.Dataset with "data" and "idx" columns.
"""
def _read_file(path: str | None, idx: int) -> dict:
if path is None:
return {"data": None, "idx": idx}
elif is_http(path):
try:
data = get_bytes_obj_from_http_path(path)
except urllib3.exceptions.HTTPError as e:
logger.warning(e)
data = None
return {"data": data, "idx": idx}
else:
try:
with filesystem.open_input_stream(path) as f:
data = f.read()
except Exception as e:
logger.warning(f"Failed to read file {path}: {e}")
data = None
return {"data": data, "idx": idx}
# Create a dataset from the paths and indices, then map to read files
records = [{"path": p, "idx": i} for p, i in paths_and_idxs]
ds = ray.data.from_items(records)
def read_batch(batch: pd.DataFrame) -> pd.DataFrame:
results = []
for _, row in batch.iterrows():
result = _read_file(row["path"], row["idx"])
results.append(result)
return pd.DataFrame(results)
ds = ds.map_batches(read_batch, batch_format="pandas")
return ds
================================================
FILE: ludwig/backend/deepspeed.py
================================================
from typing import Any
import deepspeed
from ludwig.backend.base import DataParallelBackend
from ludwig.constants import FALLBACK_BATCH_SIZE
from ludwig.distributed import init_dist_strategy
from ludwig.utils.batch_size_tuner import BatchSizeEvaluator
class DeepSpeedBackend(DataParallelBackend):
BACKEND_TYPE = "deepspeed"
def __init__(
self,
zero_optimization: dict[str, Any] | None = None,
fp16: dict[str, Any] | None = None,
bf16: dict[str, Any] | None = None,
compression_training: dict[str, Any] | None = None,
**kwargs
):
super().__init__(**kwargs)
self.zero_optimization = zero_optimization
self.fp16 = fp16
self.bf16 = bf16
self.compression_training = compression_training
def initialize(self):
# Unlike when we use the Ray backend, we need to initialize the `torch.distributed` context so we can
# broadcast, allgather, etc. before preparing the model within the trainer.
deepspeed.init_distributed()
self._distributed = init_dist_strategy(
self.BACKEND_TYPE,
zero_optimization=self.zero_optimization,
fp16=self.fp16,
bf16=self.bf16,
compression_training=self.compression_training,
)
def supports_batch_size_tuning(self) -> bool:
# TODO(travis): need to fix checkpoint saving/loading for DeepSpeed to enable tuning
return False
def tune_batch_size(self, evaluator_cls: type[BatchSizeEvaluator], dataset_len: int) -> int:
return FALLBACK_BATCH_SIZE
================================================
FILE: ludwig/backend/ray.py
================================================
#! /usr/bin/env python
# Copyright (c) 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import contextlib
import copy
import logging
import os
import tempfile
from collections.abc import Callable
from functools import partial
from typing import Any, TYPE_CHECKING
import dask
import numpy as np
import pandas as pd
import ray
import ray.train as rt
import torch
import tqdm
from fsspec.config import conf
from pyarrow.fs import FSSpecHandler, PyFileSystem
from ray import ObjectRef
from ray.train import Checkpoint, RunConfig, ScalingConfig
from ray.train.constants import TRAIN_ENABLE_WORKER_SPREAD_ENV
from ray.train.torch import TorchConfig, TorchTrainer
from ray.util.dask import ray_dask_get
from ray.util.placement_group import placement_group, remove_placement_group
if TYPE_CHECKING:
from ludwig.api import LudwigModel
from ludwig.backend.base import Backend, RemoteTrainingMixin
from ludwig.backend.datasource import read_binary_files_with_index
from ludwig.constants import MODEL_ECD, MODEL_LLM, NAME, PREPROCESSING, PROC_COLUMN, TYPE
from ludwig.data.dataframe.base import DataFrameEngine
try:
from ludwig.data.dataset.ray import (
_SCALAR_TYPES,
cast_as_tensor_dtype,
RayDataset,
RayDatasetManager,
RayDatasetShard,
)
except (ImportError, AttributeError):
_SCALAR_TYPES = cast_as_tensor_dtype = RayDataset = RayDatasetManager = RayDatasetShard = None
from ludwig.models.base import BaseModel
from ludwig.models.ecd import ECD
from ludwig.models.predictor import BasePredictor, get_output_columns, get_predictor_cls
from ludwig.schema.trainer import ECDTrainerConfig
from ludwig.trainers.registry import get_ray_trainers_registry, register_ray_trainer
from ludwig.trainers.trainer import BaseTrainer, RemoteTrainer
from ludwig.utils.data_utils import use_credentials
from ludwig.utils.fs_utils import get_fs_and_path
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.system_utils import Resources
from ludwig.utils.torch_utils import get_torch_device, initialize_pytorch
from ludwig.utils.types import Series
logger = logging.getLogger(__name__)
FIFTEEN_MINS_IN_S = 15 * 60
def _num_nodes() -> int:
node_resources = [node["Resources"] for node in ray.nodes()]
return len(node_resources)
def get_trainer_kwargs(**kwargs) -> dict[str, Any]:
kwargs = copy.deepcopy(kwargs)
# Our goal is to have a worker per resource used for training.
# The priority is GPUs, but can fall back to CPUs if there are no
# GPUs available.
use_gpu = kwargs.get("use_gpu", int(ray.cluster_resources().get("GPU", 0)) > 0)
if use_gpu:
num_workers = int(ray.cluster_resources().get("GPU", 0))
else:
num_workers = _num_nodes()
# Remove nics if present (legacy option)
kwargs.pop("nics", None)
defaults = dict(
backend=TorchConfig(),
num_workers=num_workers,
use_gpu=use_gpu,
resources_per_worker={
"CPU": 0 if use_gpu else 1,
"GPU": 1 if use_gpu else 0,
},
)
return {**defaults, **kwargs}
def _create_dask_engine(**kwargs):
from ludwig.data.dataframe.dask import DaskEngine
return DaskEngine(**kwargs)
def _create_modin_engine(**kwargs):
from ludwig.data.dataframe.modin import ModinEngine
return ModinEngine(**kwargs)
def _create_pandas_engine(**kwargs):
from ludwig.data.dataframe.pandas import PandasEngine
return PandasEngine(**kwargs)
_engine_registry = {
"dask": _create_dask_engine,
"modin": _create_modin_engine,
"pandas": _create_pandas_engine,
}
def _get_df_engine(processor):
logger.info(f"Ray processor params: {processor}")
if processor is None:
# TODO ray: find an informed way to set the parallelism, in practice
# it looks like Dask handles this well on its own most of the time
return _create_dask_engine()
processor_kwargs = processor.copy()
dtype = processor_kwargs.pop("type", "dask")
engine_cls = _engine_registry.get(dtype)
return engine_cls(**processor_kwargs)
def _make_picklable(obj):
"""Recursively convert defaultdicts (which contain unpicklable lambdas) to regular dicts."""
from collections import defaultdict
if isinstance(obj, defaultdict):
return {k: _make_picklable(v) for k, v in obj.items()}
elif isinstance(obj, dict):
return {k: _make_picklable(v) for k, v in obj.items()}
elif isinstance(obj, tuple) and hasattr(obj, "_fields"):
# NamedTuple: reconstruct with the same field names
return type(obj)(**{f: _make_picklable(getattr(obj, f)) for f in obj._fields})
elif isinstance(obj, list):
return [_make_picklable(item) for item in obj]
elif isinstance(obj, tuple):
return tuple(_make_picklable(item) for item in obj)
return obj
def train_fn(
executable_kwargs: dict[str, Any] = None,
model_ref: ObjectRef = None, # noqa: F821
training_set_metadata: dict[str, Any] = None,
features: dict[str, dict] = None,
**kwargs,
):
"""Ray Train worker function for distributed training.
Runs inside each Ray worker process. Loads the model from an object ref, wraps dataset shards, trains, and saves
results to a Ray checkpoint so the driver can retrieve them (Ray Train 2.x requires a checkpoint for metrics).
"""
# Pin GPU before loading the model to prevent memory leaking onto other devices
initialize_pytorch()
# Initialize a local distributed strategy so metric modules can sync.
from ludwig.distributed import init_dist_strategy
init_dist_strategy("local")
train_shard = RayDatasetShard(
rt.get_dataset_shard("train"),
features,
training_set_metadata,
)
try:
val_shard = rt.get_dataset_shard("val")
except KeyError:
val_shard = None
if val_shard is not None:
val_shard = RayDatasetShard(
val_shard,
features,
training_set_metadata,
)
try:
test_shard = rt.get_dataset_shard("test")
except KeyError:
test_shard = None
if test_shard is not None:
test_shard = RayDatasetShard(
test_shard,
features,
training_set_metadata,
)
model = ray.get(model_ref)
# Use Ray Train's device assignment which respects use_gpu setting,
# rather than get_torch_device() which always picks CUDA if available.
from ray.train.torch import get_device as ray_get_device
device = ray_get_device()
model = model.to(device)
trainer = RemoteTrainer(model=model, report_tqdm_to_ray=True, **executable_kwargs)
results = trainer.train(train_shard, val_shard, test_shard, **kwargs)
if results is not None:
# only return the model state dict back to the head node.
trained_model, *args = results
results = (trained_model.cpu().state_dict(), *args)
torch.cuda.empty_cache()
# Save results to a checkpoint so the driver can retrieve them.
# In Ray Train 2.x, result.metrics is only populated when a checkpoint is provided.
train_results = results, trainer.validation_field, trainer.validation_metric
# Convert defaultdicts to regular dicts so they can be pickled by torch.save.
train_results = _make_picklable(train_results)
with tempfile.TemporaryDirectory() as tmpdir:
torch.save(train_results, os.path.join(tmpdir, "train_results.pt"))
rt.report(metrics={}, checkpoint=Checkpoint.from_directory(tmpdir))
@ray.remote
def tune_batch_size_fn(
dataset: RayDataset = None,
data_loader_kwargs: dict[str, Any] = None,
executable_kwargs: dict[str, Any] = None,
model: ECD = None, # noqa: F821
ludwig_config: dict[str, Any] = None,
training_set_metadata: dict[str, Any] = None,
features: dict[str, dict] = None,
**kwargs,
) -> int:
# Pin GPU before loading the model to prevent memory leaking onto other devices
initialize_pytorch()
try:
ds = dataset.to_ray_dataset(shuffle=False)
train_shard = RayDatasetShard(
ds,
features,
training_set_metadata,
)
device = get_torch_device()
model = model.to(device)
trainer = RemoteTrainer(model=model, **executable_kwargs)
return trainer.tune_batch_size(ludwig_config, train_shard, **kwargs)
finally:
torch.cuda.empty_cache()
@ray.remote
def tune_learning_rate_fn(
dataset: RayDataset,
config: dict[str, Any],
data_loader_kwargs: dict[str, Any] = None,
executable_kwargs: dict[str, Any] = None,
model: ECD = None, # noqa: F821
training_set_metadata: dict[str, Any] = None,
features: dict[str, dict] = None,
**kwargs,
) -> float:
# Pin GPU before loading the model to prevent memory leaking onto other devices
initialize_pytorch()
try:
ds = dataset.to_ray_dataset(shuffle=False)
train_shard = RayDatasetShard(
ds,
features,
training_set_metadata,
)
device = get_torch_device()
model = model.to(device)
trainer = RemoteTrainer(model=model, **executable_kwargs)
return trainer.tune_learning_rate(config, train_shard, **kwargs)
finally:
torch.cuda.empty_cache()
class TqdmCallback(rt.UserCallback):
"""Class for a custom ray callback that updates tqdm progress bars in the driver process."""
def __init__(self) -> None:
"""Constructor for TqdmCallback."""
super().__init__()
self.progess_bars = {}
def after_report(self, run_context, metrics: list[dict], checkpoint=None) -> None:
"""Called every time ray.train.report is called from subprocesses.
In Ray 2.x, metrics is a list of metric dicts (one per worker). We look for progress_bar data from the
coordinator worker.
"""
for result in metrics:
progress_bar_opts = result.get("progress_bar")
if not progress_bar_opts:
continue
# Skip commands received by non-coordinators
if not progress_bar_opts["is_coordinator"]:
continue
_id = progress_bar_opts["id"]
action = progress_bar_opts.get("action")
if action == "create":
progress_bar_config = progress_bar_opts.get("config")
self.progess_bars[_id] = tqdm.tqdm(**progress_bar_config)
elif action == "close":
if _id in self.progess_bars:
self.progess_bars[_id].close()
elif action == "update":
update_by = progress_bar_opts.get("update_by", 1)
if _id in self.progess_bars:
self.progess_bars[_id].update(update_by)
@contextlib.contextmanager
def spread_env(use_gpu: bool = False, num_workers: int = 1, **kwargs):
if TRAIN_ENABLE_WORKER_SPREAD_ENV in os.environ:
# User set this explicitly, so honor their selection
yield
return
try:
if not use_gpu and num_workers > 1:
# When doing CPU-only training, default to a SPREAD policy to avoid
# packing too many workers on a single machine
os.environ[TRAIN_ENABLE_WORKER_SPREAD_ENV] = "1"
yield
finally:
if TRAIN_ENABLE_WORKER_SPREAD_ENV in os.environ:
del os.environ[TRAIN_ENABLE_WORKER_SPREAD_ENV]
def _build_scaling_config(trainer_kwargs: dict[str, Any]) -> ScalingConfig:
"""Convert legacy trainer kwargs to a Ray ScalingConfig."""
return ScalingConfig(
num_workers=trainer_kwargs.get("num_workers", 1),
use_gpu=trainer_kwargs.get("use_gpu", False),
resources_per_worker=trainer_kwargs.get("resources_per_worker"),
)
def run_train_remote(train_loop, trainer_kwargs: dict[str, Any], callbacks=None, datasets=None, train_loop_config=None):
"""Run a distributed training function using Ray TorchTrainer."""
resolved_kwargs = get_trainer_kwargs(**trainer_kwargs)
scaling_config = _build_scaling_config(resolved_kwargs)
torch_config = resolved_kwargs.get("backend", TorchConfig())
run_config_kwargs = {}
if callbacks:
run_config_kwargs["callbacks"] = callbacks
with spread_env(**resolved_kwargs):
torch_trainer = TorchTrainer(
train_loop_per_worker=train_loop,
train_loop_config=train_loop_config,
torch_config=torch_config,
scaling_config=scaling_config,
run_config=RunConfig(**run_config_kwargs),
datasets=datasets,
)
result = torch_trainer.fit()
return result
@register_ray_trainer(MODEL_ECD, default=True)
class RayTrainerV2(BaseTrainer):
def __init__(
self,
model: BaseModel,
trainer_kwargs: dict[str, Any],
data_loader_kwargs: dict[str, Any],
executable_kwargs: dict[str, Any],
**kwargs,
):
self.model = model.cpu()
self.data_loader_kwargs = data_loader_kwargs
self.executable_kwargs = executable_kwargs
self.trainer_kwargs = trainer_kwargs
self._validation_field = None
self._validation_metric = None
@staticmethod
def get_schema_cls():
return ECDTrainerConfig
def train(
self,
training_set: RayDataset,
validation_set: RayDataset | None = None,
test_set: RayDataset | None = None,
**kwargs,
):
executable_kwargs = self.executable_kwargs
kwargs = {
"training_set_metadata": training_set.training_set_metadata,
"features": training_set.features,
**kwargs,
}
dataset = {"train": training_set.to_ray_dataset(shuffle=True)}
if validation_set is not None:
dataset["val"] = validation_set.to_ray_dataset(shuffle=False)
if test_set is not None:
dataset["test"] = test_set.to_ray_dataset(shuffle=False)
train_loop_config = {"executable_kwargs": executable_kwargs, "model_ref": ray.put(self.model), **kwargs}
def _train_loop(config):
train_fn(**config)
result = run_train_remote(
_train_loop,
trainer_kwargs=self.trainer_kwargs,
callbacks=[TqdmCallback()],
datasets=dataset,
train_loop_config=train_loop_config,
)
# Load training results from the checkpoint saved by train_fn
with result.checkpoint.as_directory() as tmpdir:
train_results = torch.load(os.path.join(tmpdir, "train_results.pt"), weights_only=False)
results, self._validation_field, self._validation_metric = train_results
# load state dict back into the model
state_dict, *args = results
self.model.load_state_dict(state_dict)
results = (self.model, *args)
return results
def train_online(self, *args, **kwargs):
# TODO: When this is implemented we also need to update the
# Tqdm flow to report back the callback
raise NotImplementedError()
def tune_batch_size(
self,
config: dict[str, Any],
training_set: RayDataset,
**kwargs,
) -> int:
return ray.get(
tune_batch_size_fn.options(num_cpus=self.num_cpus, num_gpus=self.num_gpus).remote(
dataset=training_set,
data_loader_kwargs=self.data_loader_kwargs,
executable_kwargs=self.executable_kwargs,
model=ray.put(self.model),
ludwig_config=config,
training_set_metadata=training_set.training_set_metadata,
features=training_set.features,
**kwargs,
)
)
def tune_learning_rate(self, config, training_set: RayDataset, **kwargs) -> float:
return ray.get(
tune_learning_rate_fn.options(num_cpus=self.num_cpus, num_gpus=self.num_gpus).remote(
dataset=training_set,
config=config,
data_loader_kwargs=self.data_loader_kwargs,
executable_kwargs=self.executable_kwargs,
model=ray.put(self.model),
training_set_metadata=training_set.training_set_metadata,
features=training_set.features,
**kwargs,
)
)
@property
def validation_field(self):
return self._validation_field
@property
def validation_metric(self):
return self._validation_metric
@property
def config(self) -> ECDTrainerConfig:
return self.executable_kwargs["config"]
@property
def batch_size(self) -> int:
return self.config.batch_size
@batch_size.setter
def batch_size(self, value: int):
self.config.batch_size = value
@property
def eval_batch_size(self) -> int:
return self.config.eval_batch_size if self.config.eval_batch_size is not None else self.config.batch_size
@eval_batch_size.setter
def eval_batch_size(self, value: int):
self.config.eval_batch_size = value
@property
def resources_per_worker(self) -> dict[str, Any]:
trainer_kwargs = get_trainer_kwargs(**self.trainer_kwargs)
return trainer_kwargs.get("resources_per_worker", {})
@property
def num_cpus(self) -> int:
return self.resources_per_worker.get("CPU", 1)
@property
def num_gpus(self) -> int:
return self.resources_per_worker.get("GPU", 0)
def set_base_learning_rate(self, learning_rate: float):
self.config.learning_rate = learning_rate
def shutdown(self):
pass
def eval_fn(
predictor_kwargs: dict[str, Any] = None,
model_ref: ObjectRef = None, # noqa: F821
training_set_metadata: dict[str, Any] = None,
features: dict[str, dict] = None,
**kwargs,
):
"""Ray Train worker function for distributed evaluation.
Runs inside each Ray worker process. Loads the model from an object ref, wraps the eval dataset shard, runs
prediction and evaluation, and saves results to a Ray checkpoint for driver retrieval.
"""
# Pin GPU before loading the model to prevent memory leaking onto other devices
initialize_pytorch()
# Initialize a local distributed strategy so metric modules can sync.
from ludwig.distributed import init_dist_strategy
init_dist_strategy("local")
try:
eval_shard = RayDatasetShard(
rt.get_dataset_shard("eval"),
features,
training_set_metadata,
)
model = ray.get(model_ref)
# Use Ray Train's device assignment which respects use_gpu setting
from ray.train.torch import get_device as ray_get_device
device = ray_get_device()
model = model.to(device)
predictor_cls = get_predictor_cls(model.type())
predictor = predictor_cls(dist_model=model, model=model, report_tqdm_to_ray=True, **predictor_kwargs)
eval_results = predictor.batch_evaluation(eval_shard, **kwargs)
# Save results to a checkpoint so the driver can retrieve them.
# In Ray Train 2.x, result.metrics is only populated when a checkpoint is provided.
eval_results = _make_picklable(eval_results)
with tempfile.TemporaryDirectory() as tmpdir:
torch.save(eval_results, os.path.join(tmpdir, "eval_results.pt"))
rt.report(metrics={}, checkpoint=Checkpoint.from_directory(tmpdir))
finally:
torch.cuda.empty_cache()
class RayPredictor(BasePredictor):
def __init__(
self, model: BaseModel, df_engine: DataFrameEngine, trainer_kwargs, data_loader_kwargs, **predictor_kwargs
):
self.batch_size = predictor_kwargs["batch_size"]
self.trainer_kwargs = trainer_kwargs
self.data_loader_kwargs = data_loader_kwargs
self.predictor_kwargs = predictor_kwargs
self.actor_handles = []
self.model = model.cpu()
self.df_engine = df_engine
def get_trainer_kwargs(self) -> dict[str, Any]:
return get_trainer_kwargs(**self.trainer_kwargs)
def get_resources_per_worker(self) -> tuple[int, int]:
trainer_kwargs = self.get_trainer_kwargs()
resources_per_worker = trainer_kwargs.get("resources_per_worker", {})
num_gpus = resources_per_worker.get("GPU", 0)
num_cpus = resources_per_worker.get("CPU", (1 if num_gpus == 0 else 0))
return num_cpus, num_gpus
def batch_predict(self, dataset: RayDataset, *args, collect_logits: bool = False, **kwargs):
self._check_dataset(dataset)
predictor_kwargs = self.predictor_kwargs
output_columns = get_output_columns(self.model.output_features, include_logits=collect_logits)
batch_predictor = self.get_batch_infer_model(
self.model,
predictor_kwargs,
output_columns,
dataset.features,
dataset.training_set_metadata,
*args,
collect_logits=collect_logits,
**kwargs,
)
columns = [f.proc_column for f in self.model.input_features.values()]
def to_tensors(df: pd.DataFrame) -> pd.DataFrame:
for c in columns:
df[c] = cast_as_tensor_dtype(df[c])
return df
num_cpus, num_gpus = self.get_resources_per_worker()
predictions = dataset.ds.map_batches(to_tensors, batch_format="pandas").map_batches(
batch_predictor,
batch_size=self.batch_size,
compute=ray.data.ActorPoolStrategy(),
batch_format="pandas",
num_cpus=num_cpus,
num_gpus=num_gpus,
)
predictions = self.df_engine.from_ray_dataset(predictions)
return predictions
def predict_single(self, batch):
raise NotImplementedError("predict_single can only be called on a local predictor")
def batch_evaluation(
self,
dataset: RayDataset,
collect_predictions: bool = False,
collect_logits=False,
**kwargs,
):
# We need to be in a distributed context to collect the aggregated metrics, since it relies on collective
# communication ops. However, distributed training is not suitable for transforming one big dataset to another.
# For that we will use Ray Datasets. Therefore, we break this up into two separate steps, and two passes over
# the dataset. In the future, we can explore ways to combine these into a single step to reduce IO.
# Collect eval metrics by distributing work across nodes / gpus
datasets = {"eval": dataset.to_ray_dataset(shuffle=False)}
predictor_kwargs = {
**self.predictor_kwargs,
"collect_predictions": False,
}
eval_loop_config = {
"predictor_kwargs": predictor_kwargs,
"model_ref": ray.put(self.model),
"training_set_metadata": dataset.training_set_metadata,
"features": dataset.features,
**kwargs,
}
def _eval_loop(config):
eval_fn(**config)
result = run_train_remote(
_eval_loop,
trainer_kwargs=self.trainer_kwargs,
datasets=datasets,
train_loop_config=eval_loop_config,
)
# Load eval results from the checkpoint saved by eval_fn
with result.checkpoint.as_directory() as tmpdir:
eval_stats, _ = torch.load(os.path.join(tmpdir, "eval_results.pt"), weights_only=False)
predictions = None
if collect_predictions:
# Collect eval predictions by using Ray Datasets to transform partitions of the data in parallel
predictions = self.batch_predict(dataset, collect_logits=collect_logits)
return eval_stats, predictions
def batch_collect_activations(self, model, *args, **kwargs):
raise NotImplementedError("Ray backend does not support collecting activations at this time.")
def _check_dataset(self, dataset):
if not isinstance(dataset, RayDataset):
raise RuntimeError(f"Ray backend requires RayDataset for inference, " f"found: {type(dataset)}")
def shutdown(self):
for handle in self.actor_handles:
ray.kill(handle)
self.actor_handles.clear()
def get_batch_infer_model(
self,
model: "LudwigModel", # noqa: F821
predictor_kwargs: dict[str, Any],
output_columns: list[str],
features: dict[str, dict],
training_set_metadata: dict[str, Any],
*args,
**kwargs,
):
model_ref = ray.put(model)
_, num_gpus = self.get_resources_per_worker()
class BatchInferModel:
def __init__(self):
model = ray.get(model_ref)
# Respect the GPU setting from resources_per_worker.
# When num_gpus=0, force CPU even if CUDA is available on the machine,
# to avoid device mismatches between model outputs and targets.
if num_gpus > 0:
device = get_torch_device()
else:
device = "cpu"
self.model = model.to(device)
self.output_columns = output_columns
self.features = features
self.training_set_metadata = training_set_metadata
self.reshape_map = {
f[PROC_COLUMN]: training_set_metadata[f[NAME]].get("reshape") for f in features.values()
}
predictor_cls = get_predictor_cls(self.model.type())
predictor = predictor_cls(dist_model=self.model, model=self.model, **predictor_kwargs)
self.predict = partial(predictor.predict_single, *args, **kwargs)
def __call__(self, df: pd.DataFrame) -> pd.DataFrame:
dataset = self._prepare_batch(df)
predictions = self.predict(batch=dataset).set_index(df.index)
ordered_predictions = predictions[self.output_columns]
return ordered_predictions
def _prepare_batch(self, batch: pd.DataFrame) -> dict[str, np.ndarray]:
res = {}
for c in self.features.keys():
if self.features[c][TYPE] not in _SCALAR_TYPES:
# Ensure columns stacked instead of turned into np.array([np.array, ...], dtype=object) objects
res[c] = np.stack(batch[c].values)
else:
res[c] = batch[c].to_numpy()
for c in self.features.keys():
reshape = self.reshape_map.get(c)
if reshape is not None:
res[c] = res[c].reshape((-1, *reshape))
return res
return BatchInferModel
class RayBackend(RemoteTrainingMixin, Backend):
BACKEND_TYPE = "ray"
def __init__(self, processor=None, trainer=None, loader=None, preprocessor_kwargs=None, **kwargs):
super().__init__(dataset_manager=RayDatasetManager(self), **kwargs)
self._preprocessor_kwargs = preprocessor_kwargs or {}
self._df_engine = _get_df_engine(processor)
self._distributed_kwargs = trainer or {}
self._pytorch_kwargs = {}
self._data_loader_kwargs = loader or {}
self._preprocessor_pg = None
def initialize(self):
initialize_ray()
dask.config.set(scheduler=ray_dask_get)
# Disable placement groups on dask
dask.config.set(annotations={"ray_remote_args": {"placement_group": None}})
# Prevent Dask from converting object-dtype columns to PyArrow strings,
# which corrupts binary data, numpy arrays, and complex Python objects.
dask.config.set({"dataframe.convert-string": False})
def generate_bundles(self, num_cpu):
# Ray requires that each bundle be scheduleable on a single node.
# So a bundle of 320 cpus would never get scheduled. For now a simple heuristic
# to be used is to just request 1 cpu at a time.
return [{"CPU": 1} for _ in range(int(num_cpu))]
@contextlib.contextmanager
def provision_preprocessing_workers(self):
num_cpu = self._preprocessor_kwargs.get("num_cpu")
if not num_cpu:
logger.info(
"Backend config has num_cpu not set." " provision_preprocessing_workers() is a no-op in this case."
)
yield
else:
bundles = self.generate_bundles(num_cpu)
logger.info("Requesting bundles of %s for preprocessing", bundles)
self._preprocessor_pg = placement_group(bundles)
ready = self._preprocessor_pg.wait(FIFTEEN_MINS_IN_S)
if not ready:
remove_placement_group(self._preprocessor_pg)
raise TimeoutError(
"Ray timed out in provisioning the placement group for preprocessing."
f" {num_cpu} CPUs were requested but were unable to be provisioned."
)
logger.info("%s CPUs were requested and successfully provisioned", num_cpu)
try:
with dask.config.set(annotations={"ray_remote_args": {"placement_group": self._preprocessor_pg}}):
yield
finally:
self._release_preprocessing_workers()
def _release_preprocessing_workers(self):
if self._preprocessor_pg is not None:
remove_placement_group(self._preprocessor_pg)
self._preprocessor_pg = None
def initialize_pytorch(self, **kwargs):
# Make sure we don't claim any GPU resources on the head node
initialize_pytorch(gpus=-1)
self._pytorch_kwargs = kwargs
def create_trainer(self, model: BaseModel, **kwargs) -> "BaseTrainer": # noqa: F821
executable_kwargs = {**kwargs, **self._pytorch_kwargs}
if model.type() == MODEL_LLM:
from ludwig.trainers.registry import get_llm_ray_trainers_registry
trainer_config = kwargs.get("config")
trainer_type = trainer_config.type if trainer_config else None
trainer_cls = get_from_registry(trainer_type, get_llm_ray_trainers_registry())
else:
trainer_cls = get_from_registry(model.type(), get_ray_trainers_registry())
# Deep copy to workaround https://github.com/ray-project/ray/issues/24139
all_kwargs = {
"model": model,
"trainer_kwargs": copy.deepcopy(self._distributed_kwargs),
"data_loader_kwargs": self._data_loader_kwargs,
"executable_kwargs": executable_kwargs,
}
all_kwargs.update(kwargs)
return trainer_cls(**all_kwargs)
def create_predictor(self, model: BaseModel, **kwargs):
executable_kwargs = {**kwargs, **self._pytorch_kwargs}
return RayPredictor(
model,
self.df_engine,
copy.deepcopy(self._distributed_kwargs),
self._data_loader_kwargs,
**executable_kwargs,
)
def set_distributed_kwargs(self, **kwargs):
self._distributed_kwargs = kwargs
@property
def df_engine(self):
return self._df_engine
@property
def supports_multiprocessing(self):
return False
def check_lazy_load_supported(self, feature):
if not feature[PREPROCESSING]["in_memory"]:
raise ValueError(
f"RayBackend does not support lazy loading of data files at train time. "
f"Set preprocessing config `in_memory: True` for feature {feature[NAME]}"
)
def read_binary_files(self, column: Series, map_fn: Callable | None = None, file_size: int | None = None) -> Series:
column = column.fillna(np.nan).replace([np.nan], [None]) # normalize NaNs to None
# Assume that the list of filenames is small enough to fit in memory. Should be true unless there
# are literally billions of filenames.
# TODO(travis): determine if there is a performance penalty to passing in individual files instead of
# a directory. If so, we can do some preprocessing to determine if it makes sense to read the full directory
# then filter out files as a postprocessing step (depending on the ratio of included to excluded files in
# the directory). Based on a preliminary look at how Ray handles directory expansion to files, it looks like
# there should not be any difference between providing a directory versus a list of files.
pd_column = self.df_engine.compute(column)
fnames = pd_column.values.tolist()
idxs = pd_column.index.tolist()
# Sample a filename to extract the filesystem info
sample_fname = fnames[0]
if isinstance(sample_fname, str):
fs, _ = get_fs_and_path(sample_fname)
filesystem = PyFileSystem(FSSpecHandler(fs))
paths_and_idxs = list(zip(fnames, idxs))
ds = read_binary_files_with_index(paths_and_idxs, filesystem=filesystem)
# Rename "data" column to "value" for downstream compatibility
ds = ds.rename_columns({"data": "value"})
else:
# Assume the path has already been read in, so just convert directly to a dataset
# Name the column "value" to match the behavior of the above
column_df = column.to_frame(name="value")
column_df["idx"] = column_df.index
ds = self.df_engine.to_ray_dataset(column_df)
# Collect the Ray Dataset to pandas to avoid Arrow's string coercion
# for binary/object columns (to_dask() converts bytes to string[pyarrow],
# corrupting binary data and complex Python objects).
pdf = ds.to_pandas()
if map_fn is not None:
with use_credentials(conf):
pdf["value"] = pdf["value"].map(map_fn)
pdf = pdf.rename(columns={"value": column.name})
if "idx" in pdf.columns:
pdf = pdf.set_index("idx", drop=True)
pdf.index.name = column.index.name
# Convert to Dask for downstream compatibility.
# Note: dataframe.convert-string is disabled globally in RayBackend.initialize()
# to prevent object-dtype columns from being coerced to PyArrow strings.
df = self.df_engine.from_pandas(pdf)
return df[column.name]
@property
def num_nodes(self) -> int:
if not ray.is_initialized():
return 1
return len(ray.nodes())
@property
def num_training_workers(self) -> int:
return self._distributed_kwargs.get("num_workers", 1)
def max_concurrent_trials(self, hyperopt_config) -> int | None:
# Limit concurrency based on available resources to avoid deadlocks between
# Ray Tune trials and the Ray Datasets used internally for distributed training.
resources = self.get_available_resources()
num_cpus_per_trial = self._distributed_kwargs.get("resources_per_worker", {}).get("CPU", 1)
num_workers = self._distributed_kwargs.get("num_workers", 1)
cpus_per_trial = num_cpus_per_trial * num_workers
if cpus_per_trial > 0 and resources.cpus > 0:
return max(1, int(resources.cpus // cpus_per_trial))
return None
def tune_batch_size(self, evaluator_cls, dataset_len: int) -> int:
evaluator = evaluator_cls()
return evaluator.select_best_batch_size(dataset_len)
def batch_transform(self, df, batch_size: int, transform_fn, name: str | None = None):
name = name or "Batch Transform"
import dask.dataframe as dd
from ludwig.utils.dataframe_utils import from_batches, to_batches
# Compute Dask DataFrame to pandas before batching, as Dask-expr
# doesn't support row slicing via integer indexing (df[i:j]).
npartitions = df.npartitions if hasattr(df, "npartitions") else 1
df = self.df_engine.compute(df)
batches = to_batches(df, batch_size)
transform = transform_fn()
out_batches = [transform(batch.reset_index(drop=True)) for batch in batches]
out_df = from_batches(out_batches).reset_index(drop=True)
# Convert back to Dask so downstream code (split, etc.) still works
return dd.from_pandas(out_df, npartitions=max(1, npartitions))
def get_available_resources(self) -> Resources:
resources = ray.cluster_resources()
return Resources(cpus=resources.get("CPU", 0), gpus=resources.get("GPU", 0))
def initialize_ray():
if not ray.is_initialized():
try:
ray.init("auto", ignore_reinit_error=True)
except ConnectionError:
init_ray_local()
def init_ray_local():
logger.info("Initializing new Ray cluster...")
ray.init(ignore_reinit_error=True)
================================================
FILE: ludwig/backend/utils/__init__.py
================================================
================================================
FILE: ludwig/backend/utils/storage.py
================================================
import contextlib
from typing import Any, Optional, Union
from ludwig.utils import data_utils
CredInputs = Optional[Union[str, dict[str, Any]]]
DEFAULTS = "defaults"
ARTIFACTS = "artifacts"
DATASETS = "datasets"
CACHE = "cache"
class Storage:
def __init__(self, creds: dict[str, Any] | None):
self._creds = creds
@contextlib.contextmanager
def use_credentials(self):
with data_utils.use_credentials(self._creds):
yield
@property
def credentials(self) -> dict[str, Any] | None:
return self._creds
class StorageManager:
def __init__(
self,
defaults: CredInputs = None,
artifacts: CredInputs = None,
datasets: CredInputs = None,
cache: CredInputs = None,
):
defaults = load_creds(defaults)
cred_inputs = {
DEFAULTS: defaults,
ARTIFACTS: load_creds(artifacts),
DATASETS: load_creds(datasets),
CACHE: load_creds(cache),
}
self.storages = {k: Storage(v if v is not None else defaults) for k, v in cred_inputs.items()}
@property
def defaults(self) -> Storage:
return self.storages[DEFAULTS]
@property
def artifacts(self) -> Storage:
"""TODO(travis): Currently used for hyperopt, but should be used for all outputs."""
return self.storages[ARTIFACTS]
@property
def datasets(self) -> Storage:
"""TODO(travis): Should be used to read in datasets."""
return self.storages[DATASETS]
@property
def cache(self) -> Storage:
return self.storages[CACHE]
def load_creds(cred: CredInputs) -> dict[str, Any]:
if isinstance(cred, str):
cred = data_utils.load_json(cred)
return cred
================================================
FILE: ludwig/benchmarking/README.md
================================================
# Ludwig Benchmarking
### Some use cases
- Regression testing for ML experiments across releases and PRs.
- Model performance testing for experimenting with new features and hyperparameters.
- Resource usage tracking for the full ML pipeline.
## Ludwig benchmarking CLI and API
To run benchmarks, run the following command from the command line
```
ludwig benchmark --benchmarking_config path/to/benchmarking/config.yaml
```
To use the API
```
from ludwig.benchmarking.benchmark import benchmark
benchmarking_config_path = "path/to/benchmarking/config.yaml"
benchmark(benchmarking_config_path)
```
In what follows, we describe what the benchmarking config looks for
multiple use cases.
## The benchmarking config
The benchmarking config is where you can specify
1. The datasets you want to run the benchmarks on and their configs.
1. Whether these experiments are hyperopt or regular train and eval experiments.
1. The name of the experiment.
1. A python script to edit the specified Ludwig configs programmatically/on the fly.
1. The export path of these experiment's artifacts. (remotely or locally)
1. Whether to use `LudwigProfiler` to track resource
usage for preprocessing, training, and evaluation of the experiment.
You can find an example of a benchmarking config in the `examples/` directory.
## Basic Usage
Say you implemented a new feature and would like to test it on several datasets.
In this case, this is what the benchmarking config could look like
```
experiment_name: SMOTE_test
hyperopt: false
export:
export_artifacts: true
export_base_path: s3://benchmarking.us-west-2.ludwig.com/bench/ # include the slash at the end.
experiments:
- dataset_name: ames_housing
config_path: /home/ray/configs/ames_housing_SMOTE.yaml
experiment_name: SMOTE_test_with_hyperopt
hyperopt: true
- dataset_name: protein
- ...
...
- dataset_name: mercedes_benz_greener
config_path: /home/ray/configs/mercedes_benz_greener_SMOTE.yaml
```
For each experiment:
- `dataset_name`: name of the dataset in `ludwig.datasets` to run the benchmark on.
- `config_path` (optional): path to Ludwig config. If not specified, this will load
the config corresponding to the dataset only containing `input_features` and
`output_features`.
This will run `LudwigModel.experiment` on the datasets with their specified configs.
If these configs contain a hyperopt section and you'd like to run hyperopt, change
to `hyperopt: true`.
You can specify the same dataset multiple times with different configs.
**Exporting artifacts**
By specifying `export_artifacts: true`, this will export the experiment artifacts
to the `export_base_path`. Once the model is trained and the artifacts are pushed
to the specified path, you will get a similar message to the following:
```
Uploaded metrics report and experiment config to
s3://benchmarking.us-west-2.ludwig.com/bench/ames_housing/SMOTE_test
```
This is the directory structure of the exported artifacts for one of the experiments.
```
s3://benchmarking.us-west-2.ludwig.com/bench/
└── ames_housing
└── SMOTE_test
├── config.yaml
└── experiment_run
├── description.json
├── model
│ ├── logs
│ │ ├── test
│ │ │ └── events.out.tfevents.1663320893.macbook-pro.lan.8043.2
│ │ ├── training
│ │ │ └── events.out.tfevents.1663320893.macbook-pro.lan.8043.0
│ │ └── validation
│ │ └── events.out.tfevents.1663320893.macbook-pro.lan.8043.1
│ ├── model_hyperparameters.json
│ ├── training_progress.json
│ └── training_set_metadata.json
├── test_statistics.json
└── training_statistics.json
```
Note that model checkpoints are not exported. Any other experiments on
the `ames_housing` dataset will also live under
`s3://benchmarking.us-west-2.ludwig.com/bench/ames_housing/`
**Overriding parameters**
The benchmarking config's global parameters `experiment_name` and `hyperopt` can be overridden
if specified within an experiment.
## Programmatically editing Ludwig configs
To apply some changes to multiple Ludwig configs, you can specify a path to a python script
that does this without the need to do manual modifications across many configs. Example:
```
experiment_name: logistic_regression_hyperopt
hyperopt: true
process_config_file_path: /home/ray/process_config.py
export:
export_artifacts: true
export_base_path: s3://benchmarking.us-west-2.ludwig.com/bench/ # include the slash at the end.
experiments:
- dataset_name: ames_housing
config_path: /home/ray/configs/ames_housing_SMOTE.yaml
...
```
In `/home/ray/process_config.py`, define the following function and add custom code to modify
ludwig configs
```
def process_config(ludwig_config: dict, experiment_dict: dict) -> dict:
"""Modify a Ludwig config.
:param ludwig_config: a Ludwig config.
:param experiment_dict: a benchmarking config experiment dictionary.
returns: a modified Ludwig config.
"""
# code to modify the Ludwig config.
return ludwig_config
```
View the `examples/` folder for an example `process_config.py`.
## Benchmarking the resource usage with `LudwigProfiler`
To benchmark the resource usage of the preprocessing, training, and evaluation
steps of `LudwigModel.experiment`, you can specify in the benchmarking config
global parameters
```
profiler:
enable: true
use_torch_profiler: false
logging_interval: 0.1
```
- `enable: true` will run benchmarking with `LudwigProfiler`.
- `use_torch_profiler: false` will skip using the torch profiler.
- `logging_interval: 0.1` will instruct `LudwigProfiler` to collect
resource usage information every 0.1 seconds.
Note that profiling is only enabled in the case where `hyperopt: false`.
`LudwigProfiler` is passed in to `LudwigModel` callbacks. The specific
callbacks that will be called are:
- `on_preprocess_(start/end)`
- `on_train_(start/end)`
- `on_evaluation_(start/end)`
This is an example directory output when using the profiler:
```
full_bench_with_profiler_with_torch
├── config.yaml
├── experiment_run
├── system_resource_usage
│ ├── evaluation
│ │ └── run_0.json
│ ├── preprocessing
│ │ └── run_0.json
│ └── training
│ └── run_0.json
└── torch_ops_resource_usage
├── evaluation
│ └── run_0.json
├── preprocessing
│ └── run_0.json
└── training
└── run_0.json
```
The only difference is the `system_resource_usage` and `torch_ops_resource_usage`.
The difference between these two outputs can be found in the `LudwigProfiler` README.
## Parameters and defaults
Each of these parameters can also be specified in the experiments section to override the global value.
If not specified, the value of the global parameter will be propagated to the experiments.
- `experiment_name` (required): name of the benchmarking run.
- `export` (required): dictionary specifying whether to export the experiment artifacts and the export path.
- `hyperopt` (optional): whether this is a hyperopt run or `LudwigModel.experiment`.
- `process_config_file_path` (optional): path to python script that will modify configs.
- `profiler` (optional): dictionary specifying whether to use the profiler and its parameters.
## Comparing experiments
You can summarize the exported artifacts of two experiments on multiple datasets.
For example, if you ran two experiments on the datasets `ames_housing` called
`small_batch_size` and `big_batch_size` where you varied the batch size,
you can create a diff summary of the model performance and resource usage of the two
experiments. This is how:
```
from ludwig.benchmarking.summarize import summarize_metrics
dataset_list, metric_diffs, resource_usage_diffs = summarize_metrics(
bench_config_path = "path/to/benchmarking_config.yaml",
base_experiment = "small_batch_size",
experimental_experiment = "big_batch_size",
download_base_path = "s3://benchmarking.us-west-2.ludwig.com/bench/")
```
This will print
```
Model performance metrics for *small_batch_size* vs. *big_batch_size* on dataset *ames_housing*
Output Feature Name Metric Name small_batch_size big_batch_size Diff Diff Percentage
SalePrice mean_absolute_error 180551.609 180425.109 -126.5 -0.07
SalePrice mean_squared_error 38668763136.0 38618021888.0 -50741248.0 -0.131
SalePrice r2 -5.399 -5.391 0.008 -0.156
SalePrice root_mean_squared_error 196643.75 196514.688 -129.062 -0.066
SalePrice root_mean_squared_percentage_error 1.001 1.001 -0.001 -0.07
Exported a CSV report to summarize_output/performance_metrics/ames_housing/small_batch_size-big_batch_size.csv
Resource usage for *small_batch_size* vs. *big_batch_size* on *training* of dataset *ames_housing*
Metric Name small_batch_size big_batch_size Diff Diff Percentage
average_cpu_memory_usage 106.96 Mb 109.43 Mb 2.48 Mb 2.315
average_cpu_utilization 1.2966666666666666 1.345 0.04833333333333334 3.728
average_global_cpu_memory_available 3.46 Gb 3.46 Gb -1.10 Mb -0.031
average_global_cpu_utilization 37.43333333333334 40.49 3.056666666666665 8.166
disk_footprint 372736 413696 40960 10.989
max_cpu_memory_usage 107.50 Mb 111.93 Mb 4.43 Mb 4.117
max_cpu_utilization 1.44 1.67 0.22999999999999998 15.972
max_global_cpu_utilization 54.1 60.9 6.799999999999997 12.569
min_global_cpu_memory_available 3.46 Gb 3.46 Gb -712.00 Kb -0.02
num_cpu 10 10 0 0.0
num_oom_events 0 0 0 inf
num_runs 1 1 0 0.0
torch_cpu_average_memory_used 81.44 Kb 381.15 Kb 299.70 Kb 367.992
torch_cpu_max_memory_used 334.26 Kb 2.65 Mb 2.32 Mb 711.877
torch_cpu_time 57.400ms 130.199ms 72.799ms 126.828
torch_cuda_time 0.000us 0.000us 0.000us inf
total_cpu_memory_size 32.00 Gb 32.00 Gb 0 b 0.0
total_execution_time 334.502ms 1.114s 779.024ms 232.891
Exported a CSV report to summarize_output/resource_usage_metrics/ames_housing/training-small_batch_size-big_batch_size.csv
Resource usage for *small_batch_size* vs. *big_batch_size* on *evaluation* of dataset *ames_housing*
...
Resource usage for *small_batch_size* vs. *big_batch_size* on *preprocessing* of dataset *ames_housing*
...
```
================================================
FILE: ludwig/benchmarking/__init__.py
================================================
================================================
FILE: ludwig/benchmarking/artifacts.py
================================================
import os
from dataclasses import dataclass
from typing import Any
from ludwig.globals import MODEL_FILE_NAME
from ludwig.types import ModelConfigDict, TrainingSetMetadataDict
from ludwig.utils.data_utils import load_json, load_yaml
@dataclass
class BenchmarkingResult:
# The Ludwig benchmarking config.
benchmarking_config: dict[str, Any]
# The config for one experiment.
experiment_config: dict[str, Any]
# The Ludwig config used to run the experiment.
ludwig_config: ModelConfigDict
# The python script that is used to process the config before being used.
process_config_file: str
# Loaded `description.json` file.
description: dict[str, Any]
# Loaded `test_statistics.json` file.
test_statistics: dict[str, Any]
# Loaded `training_statistics.json` file.
training_statistics: dict[str, Any]
# Loaded `model_hyperparameters.json` file.
model_hyperparameters: dict[str, Any]
# Loaded `training_progress.json` file.
training_progress: dict[str, Any]
# Loaded `training_set_metadata.json` file.
training_set_metadata: TrainingSetMetadataDict
def build_benchmarking_result(benchmarking_config: dict, experiment_idx: int):
experiment_config = benchmarking_config["experiments"][experiment_idx]
process_config_file = ""
if experiment_config["process_config_file_path"]:
with open(experiment_config["process_config_file_path"]) as f:
process_config_file = "".join(f.readlines())
experiment_run_path = os.path.join(experiment_config["experiment_name"], "experiment_run")
return BenchmarkingResult(
benchmarking_config=benchmarking_config,
experiment_config=experiment_config,
ludwig_config=load_yaml(experiment_config["config_path"]),
process_config_file=process_config_file,
description=load_json(os.path.join(experiment_run_path, "description.json")),
test_statistics=load_json(os.path.join(experiment_run_path, "test_statistics.json")),
training_statistics=load_json(os.path.join(experiment_run_path, "training_statistics.json")),
model_hyperparameters=load_json(
os.path.join(experiment_run_path, MODEL_FILE_NAME, "model_hyperparameters.json")
),
training_progress=load_json(os.path.join(experiment_run_path, MODEL_FILE_NAME, "training_progress.json")),
training_set_metadata=load_json(
os.path.join(experiment_run_path, MODEL_FILE_NAME, "training_set_metadata.json")
),
)
================================================
FILE: ludwig/benchmarking/benchmark.py
================================================
import argparse
import importlib
import logging
import os
import shutil
from typing import Any
import ludwig.datasets
from ludwig.api import LudwigModel
from ludwig.benchmarking.artifacts import BenchmarkingResult, build_benchmarking_result
from ludwig.benchmarking.profiler_callbacks import LudwigProfilerCallback
from ludwig.benchmarking.utils import (
create_default_config,
delete_hyperopt_outputs,
delete_model_checkpoints,
export_artifacts,
load_from_module,
populate_benchmarking_config_with_defaults,
propagate_global_parameters,
save_yaml,
validate_benchmarking_config,
)
from ludwig.contrib import add_contrib_callback_args
from ludwig.hyperopt.run import hyperopt
from ludwig.utils.data_utils import load_yaml
logger = logging.getLogger()
def setup_experiment(experiment: dict[str, str]) -> dict[Any, Any]:
"""Set up the backend and load the Ludwig config.
Args:
experiment: dictionary containing the dataset name, config path, and experiment name.
Returns a Ludwig config.
"""
shutil.rmtree(os.path.join(experiment["experiment_name"]), ignore_errors=True)
if "config_path" not in experiment:
experiment["config_path"] = create_default_config(experiment)
model_config = load_yaml(experiment["config_path"])
if experiment["process_config_file_path"]:
process_config_spec = importlib.util.spec_from_file_location(
"process_config_file_path.py", experiment["process_config_file_path"]
)
process_module = importlib.util.module_from_spec(process_config_spec)
process_config_spec.loader.exec_module(process_module)
model_config = process_module.process_config(model_config, experiment)
experiment["config_path"] = experiment["config_path"].replace(
".yaml", "-" + experiment["experiment_name"] + "-modified.yaml"
)
save_yaml(experiment["config_path"], model_config)
return model_config
def benchmark_one(experiment: dict[str, str | dict[str, str]]) -> None:
"""Run a Ludwig exepriment and track metrics given a dataset name.
Args:
experiment: dictionary containing the dataset name, config path, and experiment name.
"""
logger.info(f"\nRunning experiment *{experiment['experiment_name']}* on dataset *{experiment['dataset_name']}*")
# configuring backend and paths
model_config = setup_experiment(experiment)
# loading dataset
# dataset_module = importlib.import_module(f"ludwig.datasets.{experiment['dataset_name']}")
dataset_module = ludwig.datasets.get_dataset(experiment["dataset_name"])
dataset = load_from_module(dataset_module, model_config["output_features"][0])
if experiment["hyperopt"]:
# run hyperopt
hyperopt(
config=model_config,
dataset=dataset,
output_directory=experiment["experiment_name"],
skip_save_model=True,
skip_save_training_statistics=True,
skip_save_progress=True,
skip_save_log=True,
skip_save_processed_input=True,
skip_save_unprocessed_output=True,
skip_save_predictions=True,
skip_save_training_description=True,
hyperopt_log_verbosity=0,
)
delete_hyperopt_outputs(experiment["experiment_name"])
else:
backend = None
ludwig_profiler_callbacks = None
if experiment["profiler"]["enable"]:
ludwig_profiler_callbacks = [LudwigProfilerCallback(experiment)]
# Currently, only local backend is supported with LudwigProfiler.
backend = "local"
logger.info("Currently, only local backend is supported with LudwigProfiler.")
# run model and capture metrics
model = LudwigModel(
config=model_config, callbacks=ludwig_profiler_callbacks, logging_level=logging.ERROR, backend=backend
)
model.experiment(
dataset=dataset,
output_directory=experiment["experiment_name"],
skip_save_processed_input=True,
skip_save_unprocessed_output=True,
skip_save_predictions=True,
skip_collect_predictions=True,
)
delete_model_checkpoints(experiment["experiment_name"])
def benchmark(benchmarking_config: dict[str, Any] | str) -> dict[str, tuple[BenchmarkingResult, Exception]]:
"""Launch benchmarking suite from a benchmarking config.
Args:
benchmarking_config: config or config path for the benchmarking tool. Specifies datasets and their
corresponding Ludwig configs, as well as export options.
"""
if isinstance(benchmarking_config, str):
benchmarking_config = load_yaml(benchmarking_config)
validate_benchmarking_config(benchmarking_config)
benchmarking_config = populate_benchmarking_config_with_defaults(benchmarking_config)
benchmarking_config = propagate_global_parameters(benchmarking_config)
experiment_artifacts = {}
for experiment_idx, experiment in enumerate(benchmarking_config["experiments"]):
dataset_name = experiment["dataset_name"]
try:
benchmark_one(experiment)
experiment_artifacts[dataset_name] = (build_benchmarking_result(benchmarking_config, experiment_idx), None)
except Exception as e:
logger.exception(
f"Experiment *{experiment['experiment_name']}* on dataset *{experiment['dataset_name']}* failed"
)
experiment_artifacts[dataset_name] = (None, e)
finally:
if benchmarking_config["export"]["export_artifacts"]:
export_base_path = benchmarking_config["export"]["export_base_path"]
export_artifacts(experiment, experiment["experiment_name"], export_base_path)
return experiment_artifacts
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script runs a ludwig experiment on datasets specified in the benchmark config and exports "
"the experiment artifact for each of the datasets following the export parameters specified in"
"the benchmarking config.",
prog="ludwig benchmark",
usage="%(prog)s [options]",
)
parser.add_argument("--benchmarking_config", type=str, help="The benchmarking config.")
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
benchmark(args.benchmarking_config)
================================================
FILE: ludwig/benchmarking/examples/benchmarking_config.yaml
================================================
experiment_name: example_benchmarking_run
hyperopt: false
process_config_file_path: /home/ray/process_config.py
profiler:
enable: true
use_torch_profiler: false
logging_interval: 0.1
export:
export_artifacts: true
export_base_path: s3://benchmarking.us-west-2.ludwig.com/bench/ # include the slash at the end.
experiments:
- dataset_name: ames_housing
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/ames_housing.yaml
- dataset_name: protein
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/protein.yaml
- dataset_name: mercedes_benz_greener
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/mercedes_benz_greener.yaml
- dataset_name: santander_customer_satisfaction
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/santander_customer_satisfaction.yaml
- dataset_name: connect4
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/connect4.yaml
- dataset_name: otto_group_product
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/otto_group_product.yaml
- dataset_name: bnp_claims_management
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/bnp_claims_management.yaml
- dataset_name: santander_customer_transaction
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/santander_customer_transaction.yaml
- dataset_name: allstate_claims_severity
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/allstate_claims_severity.yaml
- dataset_name: naval
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/naval.yaml
- dataset_name: sarcos
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/sarcos.yaml
- dataset_name: walmart_recruiting
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/walmart_recruiting.yaml
- dataset_name: numerai28pt6
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/numerai28pt6.yaml
- dataset_name: adult_census_income
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/adult_census_income.yaml
- dataset_name: amazon_employee_access_challenge
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/amazon_employee_access_challenge.yaml
- dataset_name: forest_cover
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/forest_cover.yaml
- dataset_name: mushroom_edibility
config_path: /home/ray/anaconda3/lib/python3.12/site-packages/ludwig/benchmarking/configs/mushroom_edibility.yaml
================================================
FILE: ludwig/benchmarking/examples/process_config.py
================================================
"""This function will take in a Ludwig config, strip away all its parameters except input and output featuresand
add some other parameters to run logistic regression hyperopt."""
def process_config(ludwig_config: dict, experiment_dict: dict) -> dict:
"""Modify a Ludwig config by programmatically adding elements to the config dictionary.
The purpose is to apply changes for all datasets that are the same or are based on the
attributes of `experiment_dict` (e.g. dataset_name) removing the need to manually apply
small changes to configs on many datasets.
:param ludwig_config: a Ludwig config.
:param experiment_dict: a benchmarking config experiment dictionary.
Returns: a modified Ludwig config.
"""
# only keep input_features and output_features
main_config_keys = list(ludwig_config.keys())
for key in main_config_keys:
if key not in ["input_features", "output_features"]:
del ludwig_config[key]
temp = {
"preprocessing": {"split": {"type": "fixed"}},
"trainer": {"epochs": 1024, "early_stop": 7, "eval_batch_size": 16384, "evaluate_training_set": False},
"hyperopt": {
"goal": "maximize",
"output_feature": None,
"metric": None,
"split": "validation",
"parameters": {
"defaults.number.preprocessing.normalization": {"space": "choice", "categories": ["zscore", None]},
"defaults.number.preprocessing.missing_value_strategy": {
"space": "choice",
"categories": ["fill_with_const", "fill_with_mean"],
},
"combiner.type": {"space": "choice", "categories": ["tabnet", "concat"]},
"trainer.learning_rate_scheduler.decay": {"space": "choice", "categories": [True, False]},
"trainer.learning_rate": {"space": "loguniform", "lower": 0.0001, "upper": 0.1},
"trainer.learning_rate_scheduler.decay_rate": {"space": "uniform", "lower": 0.4, "upper": 0.96},
"trainer.batch_size": {"space": "randint", "lower": 32, "upper": 2048},
},
"search_alg": {"type": "variant_generator"},
"executor": {"type": "ray", "num_samples": 1000},
"scheduler": {"type": "bohb", "reduction_factor": 2},
},
}
# add config parameters from temp
for key, value in temp.items():
ludwig_config[key] = value
dataset_name_to_metric = {
"ames_housing": "r2",
"mercedes_benz_greener": "r2",
"mushroom_edibility": "accuracy",
"amazon_employee_access_challenge": "roc_auc",
"naval": "r2",
"sarcos": "r2",
"protein": "r2",
"adult_census_income": "accuracy",
"otto_group_product": "accuracy",
"santander_customer_satisfaction": "accuracy",
"amazon_employee_access": "roc_auc",
"numerai28pt6": "accuracy",
"bnp_claims_management": "accuracy",
"allstate_claims_severity": "r2",
"santander_customer_transaction": "accuracy",
"connect4": "accuracy",
"forest_cover": "accuracy",
"ieee_fraud": "accuracy",
"porto_seguro_safe_driver": "accuracy",
"walmart_recruiting": "accuracy",
"poker_hand": "accuracy",
"higgs": "accuracy",
}
# add hyperopt output feature and metric.
dataset_name = experiment_dict["dataset_name"]
ludwig_config["hyperopt"]["metric"] = dataset_name_to_metric[dataset_name]
ludwig_config["hyperopt"]["output_feature"] = ludwig_config["output_features"][0]["name"]
# use sparse encoder for categorical features to mimic logistic regression.
for i, feature in enumerate(ludwig_config["input_features"]):
if feature["type"] == "category":
ludwig_config["input_features"][i]["encoder"] = "sparse"
for i, feature in enumerate(ludwig_config["output_features"]):
if feature["type"] == "category":
ludwig_config["output_features"][i]["encoder"] = "sparse"
# make sure to return the ludwig_config
return ludwig_config
================================================
FILE: ludwig/benchmarking/profiler.py
================================================
import contextlib
import glob
import logging
import os
import shutil
import threading
import time
from queue import Empty as EmptyQueueException
from queue import Queue
from subprocess import PIPE, Popen
from typing import Any
from xml.etree.ElementTree import fromstring
import psutil
import torch
from cpuinfo import get_cpu_info
from gpustat.core import GPUStatCollection
from ludwig.benchmarking.profiler_dataclasses import profiler_dataclass_to_flat_dict, TorchProfilerMetrics
from ludwig.benchmarking.reporting import get_metrics_from_system_usage_profiler, get_metrics_from_torch_profiler
from ludwig.constants import LUDWIG_TAG
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.data_utils import save_json
STOP_MESSAGE = "stop"
logger = logging.getLogger()
def get_gpu_info():
"""Gathers general hardware information about an nvidia GPU.
This function was copied from `experiment_impact_tracker` to get around a Pandas 2.0 breaking change impacting the
package. https://github.com/Breakend/experiment-impact-
tracker/blob/master/experiment_impact_tracker/gpu/nvidia.py#L48-L73
"""
p = Popen(["nvidia-smi", "-q", "-x"], stdout=PIPE)
outs, errors = p.communicate()
xml = fromstring(outs)
data = []
driver_version = xml.findall("driver_version")[0].text
cuda_version = xml.findall("cuda_version")[0].text
for gpu_id, gpu in enumerate(xml.getiterator("gpu")):
gpu_data = {}
name = [x for x in gpu.getiterator("product_name")][0].text
memory_usage = gpu.findall("fb_memory_usage")[0]
total_memory = memory_usage.findall("total")[0].text
gpu_data["name"] = name
gpu_data["total_memory"] = total_memory
gpu_data["driver_version"] = driver_version
gpu_data["cuda_version"] = cuda_version
data.append(gpu_data)
return data
def monitor(queue: Queue, info: dict[str, Any], logging_interval: int, cuda_is_available: bool) -> None:
"""Monitors hardware resource use.
Collects system specific metrics (CPU/CUDA, CPU/CUDA memory) at a `logging_interval` interval and pushes
results back to the parent process.
Args:
queue: queue from which we can push and retrieve messages sent to the function targeted by the thread.
info: dictionary containing system resource usage information about the running process.
logging_interval: time interval at which we will poll the system for usage metrics.
cuda_is_available: stores torch.cuda.is_available().
"""
info["global_cpu_memory_available"] = [psutil.virtual_memory().available]
info["global_cpu_utilization"] = [psutil.cpu_percent()]
# get the pid of the parent process.
tracked_process = psutil.Process(os.getpid())
# will return a meaningless 0 value on the first call because `interval` arg is set to None.
tracked_process.cpu_percent(interval=logging_interval)
with tracked_process.oneshot():
info["cpu_utilization"] = [tracked_process.cpu_percent() / info["num_cpu"]]
info["cpu_memory_usage"] = [tracked_process.memory_full_info().uss]
try:
info["num_accessible_cpus"] = len(tracked_process.cpu_affinity())
except Exception:
pass
while True:
try:
message = queue.get(block=False)
if isinstance(message, str):
if message == STOP_MESSAGE:
# synchronize CUDA to get accurate timing for jobs running on GPU.
if cuda_is_available:
torch.cuda.synchronize()
queue.put(info)
return
else:
queue.put(message)
except EmptyQueueException:
pass
if cuda_is_available:
gpu_infos = GPUStatCollection.new_query()
for i, gpu_info in enumerate(gpu_infos):
gpu_key = f"cuda_{i}"
info[f"{gpu_key}_memory_used"].append(gpu_info.memory_used)
with tracked_process.oneshot():
info["cpu_utilization"].append(tracked_process.cpu_percent() / info["num_cpu"])
info["cpu_memory_usage"].append(tracked_process.memory_full_info().uss)
info["global_cpu_memory_available"].append(psutil.virtual_memory().available)
info["global_cpu_utilization"].append(psutil.cpu_percent())
time.sleep(logging_interval)
class LudwigProfiler(contextlib.ContextDecorator):
"""Track system resource (hardware and software) usage.
Warning: If `use_torch_profiler=True` while profiling on CUDA, it's not possible to benchmark DataLoaders
with `num_workers > 0` due to CUDA multiprocessing limitations. See warning under `profile` class
definition: https://github.com/pytorch/pytorch/blob/master/torch/autograd/profiler.py
Attributes:
tag: a string tag describing the code block/function that we're tracking.
(e.g trainer.train, preprocessing, etc.)
output_dir: path where metrics are saved.
logging_interval: time interval in seconds at which system is polled for resource usage.
"""
def __init__(self, tag: str, use_torch_profiler: bool, output_dir: str, logging_interval: float = 0.1) -> None:
self.tag = tag
self._tag = LUDWIG_TAG + self.tag
self.use_torch_profiler = use_torch_profiler
self.output_dir = output_dir
self.logging_interval = logging_interval
self.cuda_is_available = torch.cuda.is_available()
self.launched = False
if self.use_torch_profiler:
self.profiler_activities = [torch.profiler.ProfilerActivity.CPU]
if self.cuda_is_available:
self.profiler_activities.append(torch.profiler.ProfilerActivity.CUDA)
os.makedirs(os.path.join(self.output_dir), exist_ok=True)
def _init_tracker_info(self):
"""Initialize new self.info, self.torch_profiler, and self.torch_record_function instances.
Important to call this in __enter__ if the user decides not to create a new class instance and therefore
__init__ wouldn't be called.
"""
self.info = {"code_block_tag": self.tag}
if self.use_torch_profiler:
self.torch_profiler = torch.profiler.profile(activities=self.profiler_activities, profile_memory=True)
self.torch_record_function = torch.profiler.record_function(self._tag)
def _populate_static_information(self) -> None:
"""Populate the report with static software and hardware information."""
self.info["ludwig_version"] = LUDWIG_VERSION
self.info["start_disk_usage"] = shutil.disk_usage(os.path.expanduser("~")).used
# CPU information
cpu_info = get_cpu_info()
self.info["cpu_architecture"] = cpu_info["arch"]
self.info["num_cpu"] = psutil.cpu_count()
self.info["cpu_name"] = cpu_info.get("brand_raw", "unknown")
self.info["total_cpu_memory_size"] = psutil.virtual_memory().total
# GPU information
if self.cuda_is_available:
gpu_infos = get_gpu_info()
gpu_usage = GPUStatCollection.new_query()
for i, gpu_info in enumerate(gpu_infos):
gpu_key = f"cuda_{i}"
self.info[f"{gpu_key}_memory_used"] = [gpu_usage[i].memory_used]
self.info[f"{gpu_key}_name"] = gpu_info["name"]
self.info[f"{gpu_key}_total_memory"] = gpu_info["total_memory"]
self.info[f"{gpu_key}_driver_version"] = gpu_info["driver_version"]
self.info[f"{gpu_key}_cuda_version"] = gpu_info["cuda_version"]
# recording in microseconds to be in line with torch profiler time recording.
self.info["start_time"] = time.perf_counter_ns() / 1000
def __enter__(self):
"""Populate static information and monitors resource usage."""
if self.launched:
raise RuntimeError("LudwigProfiler already launched. You can't use the same instance.")
self._init_tracker_info()
self._populate_static_information()
if self.use_torch_profiler:
# contextlib.ExitStack gracefully handles situations where __enter__ or __exit__ calls throw exceptions.
with contextlib.ExitStack() as ctx_exit_stack:
try:
# Launch torch.profiler to track PyTorch operators.
ctx_exit_stack.enter_context(self.torch_profiler)
except RuntimeError:
# PyTorch profiler is already enabled on this thread.
# Using the running PyTorch profiler to track events.
self.torch_profiler = None
ctx_exit_stack.enter_context(self.torch_record_function)
self._ctx_exit_stack = ctx_exit_stack.pop_all()
try:
# Starting thread to monitor system resource usage.
self.queue = Queue()
self.t = threading.Thread(
target=monitor,
args=(
self.queue,
self.info,
self.logging_interval,
self.cuda_is_available,
),
)
self.t.start()
self.launched = True
except Exception:
self.launched = False
logger.exception("Encountered exception when launching tracker thread.")
return self
def __exit__(self, exc_type, exc_val, exc_tb) -> None:
"""Stop profiling, postprocess and export resource usage metrics."""
try:
self.queue.put(STOP_MESSAGE)
self.t.join()
result = self.queue.get()
# If monitor thread crashed, result may be a string instead of dict
if isinstance(result, dict):
self.info = result
# recording in microseconds to be in line with torch profiler time recording.
self.info["end_time"] = time.perf_counter_ns() / 1000
self.info["end_disk_usage"] = shutil.disk_usage(os.path.expanduser("~")).used
self.launched = False
except Exception:
logger.exception("Encountered exception when joining tracker thread.")
finally:
if self.use_torch_profiler:
self._ctx_exit_stack.close()
self._export_torch_metrics()
self._export_system_usage_metrics()
def _export_system_usage_metrics(self):
"""Export system resource usage metrics (no torch operators)."""
system_usage_metrics = get_metrics_from_system_usage_profiler(self.info)
output_subdir = os.path.join(self.output_dir, "system_resource_usage", system_usage_metrics.code_block_tag)
os.makedirs(output_subdir, exist_ok=True)
num_prev_runs = len(glob.glob(os.path.join(output_subdir, "run_*.json")))
file_name = os.path.join(output_subdir, f"run_{num_prev_runs}.json")
save_json(file_name, profiler_dataclass_to_flat_dict(system_usage_metrics))
def _reformat_torch_usage_metrics_tags(
self, torch_usage_metrics: dict[str, Any]
) -> dict[str, list[TorchProfilerMetrics]]:
reformatted_dict = {}
for key, value in torch_usage_metrics.items():
assert key.startswith(LUDWIG_TAG)
reformatted_key = key[len(LUDWIG_TAG) :]
reformatted_dict[reformatted_key] = value
return reformatted_dict
def _export_torch_metrics(self):
"""Export resource usage metrics of torch operators."""
if self.torch_profiler:
torch_usage_metrics = get_metrics_from_torch_profiler(self.torch_profiler)
torch_usage_metrics = self._reformat_torch_usage_metrics_tags(torch_usage_metrics)
for tag, runs in torch_usage_metrics.items():
temp_dir = os.path.join(self.output_dir, "torch_ops_resource_usage", tag)
os.makedirs(temp_dir, exist_ok=True)
for run in runs:
num_prev_runs = len(glob.glob(os.path.join(temp_dir, "run_*.json")))
save_json(os.path.join(temp_dir, f"run_{num_prev_runs}.json"), profiler_dataclass_to_flat_dict(run))
================================================
FILE: ludwig/benchmarking/profiler_callbacks.py
================================================
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.benchmarking.profiler import LudwigProfiler
from ludwig.callbacks import Callback
from ludwig.constants import EVALUATION, PREPROCESSING, TRAINING
# TODO: Change annotation to PublicAPI once Ludwig 0.7 is released
@DeveloperAPI
class LudwigProfilerCallback(Callback):
"""Class that defines the methods necessary to hook into process."""
def __init__(self, experiment: dict[str, Any]):
self.experiment_name = experiment["experiment_name"]
self.use_torch_profiler = experiment["profiler"]["use_torch_profiler"]
self.logging_interval = experiment["profiler"]["logging_interval"]
self.preprocess_profiler = None
self.train_profiler = None
self.evaluation_profiler = None
def on_preprocess_start(self, *args, **kwargs):
self.preprocess_profiler = LudwigProfiler(
tag=PREPROCESSING,
output_dir=self.experiment_name,
use_torch_profiler=self.use_torch_profiler,
logging_interval=self.logging_interval,
)
self.preprocess_profiler.__enter__()
def on_preprocess_end(self, *args, **kwargs):
self.preprocess_profiler.__exit__(None, None, None)
del self.preprocess_profiler
def on_train_start(self, *args, **kwargs):
self.train_profiler = LudwigProfiler(
tag=TRAINING,
output_dir=self.experiment_name,
use_torch_profiler=self.use_torch_profiler,
logging_interval=self.logging_interval,
)
self.train_profiler.__enter__()
def on_train_end(self, *args, **kwargs):
self.train_profiler.__exit__(None, None, None)
del self.train_profiler
def on_evaluation_start(self):
self.evaluation_profiler = LudwigProfiler(
tag=EVALUATION,
output_dir=self.experiment_name,
use_torch_profiler=self.use_torch_profiler,
logging_interval=self.logging_interval,
)
self.evaluation_profiler.__enter__()
def on_evaluation_end(self):
self.evaluation_profiler.__exit__(None, None, None)
del self.evaluation_profiler
================================================
FILE: ludwig/benchmarking/profiler_dataclasses.py
================================================
import dataclasses
from dataclasses import dataclass
from ludwig.utils.data_utils import flatten_dict
@dataclass
class DeviceUsageMetrics:
# Max CUDA memory utilization of the code block.
max_memory_used: float
# Average CUDA memory utilization of the code block.
average_memory_used: float
@dataclass
class SystemResourceMetrics:
# Name of the code block/function to be profiled.
code_block_tag: str
# Name of the CPU that the code ran on.
cpu_name: str
# CPU architecture that the code ran on.
cpu_architecture: str
# Number of CPUs on the machine.
num_cpu: int
# Total CPU memory size.
total_cpu_memory_size: float
# Ludwig version in the environment.
ludwig_version: str
# Total execution time of the code block.
total_execution_time: float
# The change in disk memory before and after the code block ran.
disk_footprint: float
# Max CPU utilization of the code block.
max_cpu_utilization: float
# Max CPU memory (RAM) utilization of the code block.
max_cpu_memory_usage: float
# Min system-wide CPU memory available (how much physical memory is left).
min_global_cpu_memory_available: float
# Max system-wide CPU utilization.
max_global_cpu_utilization: float
# Average CPU utilization of the code block.
average_cpu_utilization: float
# Average CPU memory (RAM) utilization of the code block.
average_cpu_memory_usage: float
# Average system-wide CPU memory available (how much physical memory is left).
average_global_cpu_memory_available: float
# Average system-wide CPU utilization.
average_global_cpu_utilization: float
# Per device usage. Dictionary containing max and average memory used per device.
device_usage: dict[str, DeviceUsageMetrics]
@dataclass
class TorchProfilerMetrics:
# Time taken by torch ops to execute on the CPU.
torch_cpu_time: float
# Time taken by torch ops to execute on CUDA devices.
torch_cuda_time: float
# Number of out of memory events.
num_oom_events: int
# Per device usage by torch ops. Dictionary containing max and average memory used per device.
device_usage: dict[str, DeviceUsageMetrics]
def profiler_dataclass_to_flat_dict(data: SystemResourceMetrics | TorchProfilerMetrics) -> dict:
"""Returns a flat dictionary representation, with the device_usage key removed."""
nested_dict = dataclasses.asdict(data)
nested_dict[""] = nested_dict.pop("device_usage")
return flatten_dict(nested_dict, sep="")
================================================
FILE: ludwig/benchmarking/reporting.py
================================================
from collections import Counter, defaultdict
from statistics import mean
from typing import Any
import torch
from torch._C._autograd import _KinetoEvent
from torch.autograd import DeviceType, profiler_util
from ludwig.benchmarking.profiler_dataclasses import DeviceUsageMetrics, SystemResourceMetrics, TorchProfilerMetrics
from ludwig.constants import LUDWIG_TAG
def initialize_stats_dict(main_function_events: list[profiler_util.FunctionEvent]) -> dict[str, list]:
"""Initialize dictionary which stores resource usage information per tagged code block.
:param main_function_events: list of main function events.
"""
info = {}
for event_name in [evt.name for evt in main_function_events]:
info[event_name] = []
return info
def get_memory_details(kineto_event: _KinetoEvent) -> tuple[str, int]:
"""Get device name and number of bytes (de)allocated during an event.
:param kineto_event: a Kineto event instance.
"""
if kineto_event.device_type() in [DeviceType.CPU, DeviceType.MKLDNN, DeviceType.IDEEP]:
return "cpu", kineto_event.nbytes()
elif kineto_event.device_type() in [DeviceType.CUDA, DeviceType.HIP]:
return f"cuda_{kineto_event.device_index()}", kineto_event.nbytes()
else:
raise ValueError(f"Device {kineto_event.device_type()} is not valid.")
def get_device_memory_usage(
kineto_event: _KinetoEvent, memory_events: list[list[_KinetoEvent | bool]]
) -> dict[str, DeviceUsageMetrics]:
"""Get CPU and CUDA memory usage for an event.
:param kineto_event: a Kineto event instance.
:param memory_events: list of memory events.
"""
mem_records_acc = profiler_util.MemRecordsAcc(memory_events)
start_us = kineto_event.start_ns() / 1000
end_us = start_us + kineto_event.duration_ns() / 1000
records_in_interval = mem_records_acc.in_interval(start_us, end_us)
memory_so_far = defaultdict(int)
count_so_far = defaultdict(int)
average_so_far = defaultdict(float)
max_so_far = defaultdict(int)
for mem_record in records_in_interval:
device, nbytes = get_memory_details(mem_record[0])
memory_so_far[device] += nbytes
max_so_far[device] = max(max_so_far[device], memory_so_far[device])
average_so_far[device] = (memory_so_far[device] + (average_so_far[device] * count_so_far[device])) / (
count_so_far[device] + 1
)
count_so_far[device] += 1
memory_info_per_device = {}
for device in count_so_far:
memory_info_per_device[f"torch_{device}_"] = DeviceUsageMetrics(
max_memory_used=max_so_far[device], average_memory_used=average_so_far[device]
)
return memory_info_per_device
def get_torch_op_time(events: list[profiler_util.FunctionEvent], attr: str) -> int | float:
"""Get time torch operators spent executing for a list of events.
:param events: list of events.
:param attr: a FunctionEvent attribute. Expecting one of "cpu_time_total", "device_time_total".
"""
if attr not in ["cpu_time_total", "device_time_total"]:
return -1
total = 0
for e in events:
# Possible trace_names are torch ops, or tagged code blocks by LudwigProfiler (which are
# prepended with LUDWIG_TAG).
if LUDWIG_TAG not in e.trace_name:
total += getattr(e, attr)
else:
total += get_torch_op_time(e.cpu_children, attr)
return total
def get_device_run_durations(function_event: profiler_util.FunctionEvent) -> tuple[float, float]:
"""Get CPU and device run durations for an event.
:param function_event: a function event instance.
"""
torch_cpu_time = get_torch_op_time(function_event.cpu_children, "cpu_time_total")
torch_device_time = get_torch_op_time(function_event.cpu_children, "device_time_total")
return torch_cpu_time, torch_device_time
def get_num_oom_events(kineto_event: _KinetoEvent, out_of_memory_events: list[list[_KinetoEvent | bool]]) -> int:
oom_records_acc = profiler_util.MemRecordsAcc(out_of_memory_events)
start_us = kineto_event.start_ns() / 1000
end_us = start_us + kineto_event.duration_ns() / 1000
records_in_interval = oom_records_acc.in_interval(start_us, end_us)
return len(list(records_in_interval))
def get_resource_usage_report(
main_kineto_events: list[_KinetoEvent],
main_function_events: list[profiler_util.FunctionEvent],
memory_events: list[list[_KinetoEvent | bool]],
out_of_memory_events: list[list[_KinetoEvent | bool]],
info: dict[str, Any],
) -> dict[str, list[TorchProfilerMetrics]]:
"""Get relevant information from Kineto events and function events exported by the profiler.
:param main_kineto_events: list of main Kineto events.
:param main_function_events: list of main function events.
:param memory_events: list of memory events.
:param out_of_memory_events: list of out of memory events.
:param info: dictionary used to record resource usage metrics.
"""
main_kineto_events = sorted(
(evt for evt in main_kineto_events if LUDWIG_TAG in evt.name()), key=lambda x: x.correlation_id()
)
main_function_events = sorted((evt for evt in main_function_events if LUDWIG_TAG in evt.name), key=lambda x: x.id)
for kineto_event, function_event in zip(main_kineto_events, main_function_events):
# Two different instances of `function_event` can have the same name if a the same
# tagged code block/function was executed more than once.
memory_info_per_device = get_device_memory_usage(kineto_event, memory_events)
torch_cpu_time, torch_cuda_time = get_device_run_durations(function_event)
num_oom_events = get_num_oom_events(kineto_event, out_of_memory_events)
torch_profiler_metrics = TorchProfilerMetrics(
torch_cpu_time=torch_cpu_time,
torch_cuda_time=torch_cuda_time,
num_oom_events=num_oom_events,
device_usage=memory_info_per_device,
)
info[function_event.name].append(torch_profiler_metrics)
return info
def get_all_events(kineto_events: list[_KinetoEvent], function_events: profiler_util.EventList) -> tuple[
list[_KinetoEvent],
list[profiler_util.FunctionEvent],
list[list[_KinetoEvent | bool]],
list[list[_KinetoEvent | bool]],
]:
"""Return main Kineto and function events, memory and OOM events for functions/code blocks tagged in
LudwigProfiler.
:param kineto_events: list of Kineto Events.
:param function_events: list of function events.
"""
# LUDWIG_TAG is prepended to LudwigProfiler tags. This edited tag is passed in to `torch.profiler.record_function`
# so we can easily retrieve events for code blocks wrapped with LudwigProfiler.
main_function_events = [evt for evt in function_events if LUDWIG_TAG in evt.name]
main_kineto_events = [event for event in kineto_events if LUDWIG_TAG in event.name()]
memory_events = [[event, False] for event in kineto_events if profiler_util.MEMORY_EVENT_NAME in event.name()]
# profiler_util.OUT_OF_MEMORY_EVENT_NAME seems to only be in newer versions of torch.
out_of_memory_events = [[event, False] for event in kineto_events if "[OutOfMemory]" in event.name()]
return main_kineto_events, main_function_events, memory_events, out_of_memory_events
def get_metrics_from_torch_profiler(profile: torch.profiler.profiler.profile) -> dict[str, list[TorchProfilerMetrics]]:
"""Export time and resource usage metrics (CPU and CUDA) from a PyTorch profiler.
The profiler keeps track of *torch operations* being executed in C++. It keeps track
of what device they're executed on, their execution time, and memory usage.
We only track the aforementioned metrics, but the torch profiler can keep track of
the stack trace, FLOPs, and torch modules. Tracking each additional item adds overhead.
The torch profiler surfaces these metrics that are tracked under the hood by `libkineto`.
More on the Kineto project: https://github.com/pytorch/kineto
:param profile: profiler object that contains all the events that
were registered during the execution of the wrapped code block.
"""
# events in both of these lists are in chronological order.
kineto_events = profile.profiler.kineto_results.events()
function_events = profile.profiler.function_events
main_kineto_events, main_function_events, memory_events, out_of_memory_events = get_all_events(
kineto_events, function_events
)
assert Counter([event.name for event in main_function_events]) == Counter(
[event.name() for event in main_kineto_events]
)
info = initialize_stats_dict(main_function_events)
info = get_resource_usage_report(
main_kineto_events, main_function_events, memory_events, out_of_memory_events, info
)
return info
def get_metrics_from_system_usage_profiler(system_usage_info: dict) -> SystemResourceMetrics:
"""Package system resource usage metrics (no torch operators) in a dataclass.
:param system_usage_info: dictionary containing resource usage information.
"""
device_usage_dict: dict[str, DeviceUsageMetrics] = {}
for key in system_usage_info:
if "cuda_" in key and "_memory_used" in key:
cuda_device_name = "_".join(key.split("_")[:2]) + "_"
max_memory_used = max(system_usage_info[key], default=0)
average_memory_used = mean(system_usage_info.get(key, [0]))
device_usage_dict[cuda_device_name] = DeviceUsageMetrics(
max_memory_used=max_memory_used, average_memory_used=average_memory_used
)
return SystemResourceMetrics(
code_block_tag=system_usage_info["code_block_tag"],
cpu_name=system_usage_info.get("cpu_name", "unknown"),
cpu_architecture=system_usage_info["cpu_architecture"],
num_cpu=system_usage_info["num_cpu"],
total_cpu_memory_size=system_usage_info["total_cpu_memory_size"],
ludwig_version=system_usage_info["ludwig_version"],
total_execution_time=system_usage_info["end_time"] - system_usage_info["start_time"],
disk_footprint=system_usage_info["end_disk_usage"] - system_usage_info["start_disk_usage"],
max_cpu_utilization=max(system_usage_info["cpu_utilization"], default=0),
max_cpu_memory_usage=max(system_usage_info["cpu_memory_usage"], default=0),
min_global_cpu_memory_available=min(system_usage_info["global_cpu_memory_available"], default=0),
max_global_cpu_utilization=max(system_usage_info["global_cpu_utilization"], default=0),
average_cpu_utilization=mean(system_usage_info.get("cpu_utilization", [0])),
average_cpu_memory_usage=mean(system_usage_info.get("cpu_memory_usage", [0])),
average_global_cpu_memory_available=mean(system_usage_info.get("global_cpu_memory_available", [0])),
average_global_cpu_utilization=mean(system_usage_info.get("global_cpu_utilization", [0])),
device_usage=device_usage_dict,
)
================================================
FILE: ludwig/benchmarking/summarize.py
================================================
import argparse
import logging
import os
import shutil
from ludwig.benchmarking.summary_dataclasses import (
build_metrics_diff,
build_resource_usage_diff,
export_metrics_diff_to_csv,
export_resource_usage_diff_to_csv,
MetricsDiff,
ResourceUsageDiff,
)
from ludwig.benchmarking.utils import download_artifacts
logger = logging.getLogger()
def summarize_metrics(
bench_config_path: str, base_experiment: str, experimental_experiment: str, download_base_path: str
) -> tuple[list[str], list[MetricsDiff], list[list[ResourceUsageDiff]]]:
"""Build metric and resource usage diffs from experiment artifacts.
bench_config_path: bench config file path. Can be the same one that was used to run
these experiments.
base_experiment: name of the experiment we're comparing against.
experimental_experiment: name of the experiment we're comparing.
download_base_path: base path under which live the stored artifacts of
the benchmarking experiments.
"""
local_dir, dataset_list = download_artifacts(
bench_config_path, base_experiment, experimental_experiment, download_base_path
)
metric_diffs, resource_usage_diffs = [], []
for dataset_name in dataset_list:
try:
metric_diff = build_metrics_diff(dataset_name, base_experiment, experimental_experiment, local_dir)
metric_diffs.append(metric_diff)
base_path = os.path.join(local_dir, dataset_name, base_experiment)
experimental_path = os.path.join(local_dir, dataset_name, experimental_experiment)
resource_usage_diff = build_resource_usage_diff(
base_path, experimental_path, base_experiment, experimental_experiment
)
resource_usage_diffs.append(resource_usage_diff)
except Exception:
logger.exception(f"Exception encountered while creating diff summary for {dataset_name}.")
shutil.rmtree(local_dir, ignore_errors=True)
export_and_print(dataset_list, metric_diffs, resource_usage_diffs)
return dataset_list, metric_diffs, resource_usage_diffs
def export_and_print(
dataset_list: list[str], metric_diffs: list[MetricsDiff], resource_usage_diffs: list[list[ResourceUsageDiff]]
) -> None:
"""Export to CSV and print a diff of performance and resource usage metrics of two experiments.
:param dataset_list: list of datasets for which to print the diffs.
:param metric_diffs: Diffs for the performance metrics by dataset.
:param resource_usage_diffs: Diffs for the resource usage metrics per dataset per LudwigProfiler tag.
"""
for dataset_name, experiment_metric_diff in zip(dataset_list, metric_diffs):
output_path = os.path.join("summarize_output", "performance_metrics", dataset_name)
os.makedirs(output_path, exist_ok=True)
logger.info(
"Model performance metrics for *{}* vs. *{}* on dataset *{}*".format(
experiment_metric_diff.base_experiment_name,
experiment_metric_diff.experimental_experiment_name,
experiment_metric_diff.dataset_name,
)
)
logger.info(experiment_metric_diff.to_string())
filename = (
"-".join([experiment_metric_diff.base_experiment_name, experiment_metric_diff.experimental_experiment_name])
+ ".csv"
)
export_metrics_diff_to_csv(experiment_metric_diff, os.path.join(output_path, filename))
for dataset_name, experiment_resource_diff in zip(dataset_list, resource_usage_diffs):
output_path = os.path.join("summarize_output", "resource_usage_metrics", dataset_name)
os.makedirs(output_path, exist_ok=True)
for tag_diff in experiment_resource_diff:
logger.info(
"Resource usage for *{}* vs. *{}* on *{}* of dataset *{}*".format(
tag_diff.base_experiment_name,
tag_diff.experimental_experiment_name,
tag_diff.code_block_tag,
dataset_name,
)
)
logger.info(tag_diff.to_string())
filename = (
"-".join(
[tag_diff.code_block_tag, tag_diff.base_experiment_name, tag_diff.experimental_experiment_name]
)
+ ".csv"
)
export_resource_usage_diff_to_csv(tag_diff, os.path.join(output_path, filename))
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Summarize the model performance metrics and resource usage metrics of two experiments.",
prog="python summarize.py",
usage="%(prog)s [options]",
)
parser.add_argument("--benchmarking_config", type=str, help="The benchmarking config.")
parser.add_argument("--base_experiment", type=str, help="The name of the first experiment.")
parser.add_argument("--experimental_experiment", type=str, help="The name of the second experiment.")
parser.add_argument("--download_base_path", type=str, help="The base path to download experiment artifacts from.")
args = parser.parse_args()
summarize_metrics(
args.benchmarking_config, args.base_experiment, args.experimental_experiment, args.download_base_path
)
================================================
FILE: ludwig/benchmarking/summary_dataclasses.py
================================================
import csv
import logging
import os
from dataclasses import dataclass
from statistics import mean
import ludwig.modules.metric_modules # noqa: F401
from ludwig.benchmarking.utils import format_memory, format_time
from ludwig.globals import MODEL_FILE_NAME, MODEL_HYPERPARAMETERS_FILE_NAME
from ludwig.modules.metric_registry import get_metric_classes, metric_feature_type_registry # noqa: F401
from ludwig.types import ModelConfigDict
from ludwig.utils.data_utils import load_json
logger = logging.getLogger()
@dataclass
class MetricDiff:
"""Diffs for a metric."""
# Name of the metric.
name: str
# Value of the metric in base experiment (the one we benchmark against).
base_value: float
# Value of the metric in the experimental experiment.
experimental_value: float
# experimental_value - base_value.
diff: float
# Percentage of change the metric with respect to base_value.
diff_percentage: float | str
def __post_init__(self):
"""Add human-readable string representations to the field."""
if "memory" in self.name:
self.base_value_str = format_memory(self.base_value)
self.experimental_value_str = format_memory(self.experimental_value)
self.diff_str = format_memory(self.diff)
elif "time" in self.name:
self.base_value_str = format_time(self.base_value)
self.experimental_value_str = format_time(self.experimental_value)
self.diff_str = format_time(self.diff)
else:
self.base_value_str = str(self.base_value)
self.experimental_value_str = str(self.experimental_value)
self.diff_str = str(self.diff)
def build_diff(name: str, base_value: float, experimental_value: float) -> MetricDiff:
"""Build a diff between any type of metric.
:param name: name assigned to the metric to be diff-ed.
:param base_value: base value of the metric.
:param experimental_value: experimental value of the metric.
"""
diff = experimental_value - base_value
diff_percentage = 100 * diff / base_value if base_value != 0 else "inf"
return MetricDiff(
name=name,
base_value=base_value,
experimental_value=experimental_value,
diff=diff,
diff_percentage=diff_percentage,
)
##############################
# Resource Usage Dataclasses #
##############################
@dataclass
class MetricsSummary:
"""Summary of metrics from one experiment."""
# Path containing the artifacts for the experiment.
experiment_local_directory: str
# Full Ludwig config.
config: ModelConfigDict
# LudwigModel output feature type.
output_feature_type: str
# LudwigModel output feature name.
output_feature_name: str
# Dictionary that maps from metric name to their values.
metric_to_values: dict[str, float | int]
# Names of metrics for the output feature.
metric_names: set[str]
@dataclass
class MetricsDiff:
"""Store diffs for two experiments."""
# Dataset the two experiments are being compared on.
dataset_name: str
# Name of the base experiment (the one we benchmark against).
base_experiment_name: str
# Name of the experimental experiment.
experimental_experiment_name: str
# Path under which all artifacts live on the local machine.
local_directory: str
# `MetricsSummary` of the base_experiment.
base_summary: MetricsSummary
# `MetricsSummary` of the experimental_experiment.
experimental_summary: MetricsSummary
# `List[MetricDiff]` containing diffs for metric of the two experiments.
metrics: list[MetricDiff]
def to_string(self):
ret = []
spacing_str = "{:<20} {:<33} {:<13} {:<13} {:<13} {:<5}"
ret.append(
spacing_str.format(
"Output Feature Name",
"Metric Name",
self.base_experiment_name,
self.experimental_experiment_name,
"Diff",
"Diff Percentage",
)
)
for metric in sorted(self.metrics, key=lambda m: m.name):
output_feature_name = self.base_summary.output_feature_name
metric_name = metric.name
experiment1_val = round(metric.base_value, 3)
experiment2_val = round(metric.experimental_value, 3)
diff = round(metric.diff, 3)
diff_percentage = metric.diff_percentage
if isinstance(diff_percentage, float):
diff_percentage = round(metric.diff_percentage, 3)
ret.append(
spacing_str.format(
output_feature_name,
metric_name,
experiment1_val,
experiment2_val,
diff,
diff_percentage,
)
)
return "\n".join(ret)
def export_metrics_diff_to_csv(metrics_diff: MetricsDiff, path: str):
"""Export metrics report to .csv.
:param metrics_diff: MetricsDiff object containing the diff for two experiments on a dataset.
:param path: file name of the exported csv.
"""
with open(path, "w", newline="") as f:
writer = csv.DictWriter(
f,
fieldnames=[
"Dataset Name",
"Output Feature Name",
"Metric Name",
metrics_diff.base_experiment_name,
metrics_diff.experimental_experiment_name,
"Diff",
"Diff Percentage",
],
)
writer.writeheader()
for metric in sorted(metrics_diff.metrics, key=lambda m: m.name):
output_feature_name = metrics_diff.base_summary.output_feature_name
metric_name = metric.name
experiment1_val = round(metric.base_value, 3)
experiment2_val = round(metric.experimental_value, 3)
diff = round(metric.diff, 3)
diff_percentage = metric.diff_percentage
if isinstance(diff_percentage, float):
diff_percentage = round(metric.diff_percentage, 3)
writer.writerow(
{
"Dataset Name": metrics_diff.dataset_name,
"Output Feature Name": output_feature_name,
"Metric Name": metric_name,
metrics_diff.base_experiment_name: experiment1_val,
metrics_diff.experimental_experiment_name: experiment2_val,
"Diff": diff,
"Diff Percentage": diff_percentage,
}
)
logger.info(f"Exported a CSV report to {path}\n")
def build_metrics_summary(experiment_local_directory: str) -> MetricsSummary:
"""Build a metrics summary for an experiment.
:param experiment_local_directory: directory where the experiment artifacts live.
e.g. local_experiment_repo/ames_housing/some_experiment/
"""
config = load_json(
os.path.join(experiment_local_directory, "experiment_run", MODEL_FILE_NAME, MODEL_HYPERPARAMETERS_FILE_NAME)
)
report = load_json(os.path.join(experiment_local_directory, "experiment_run", "test_statistics.json"))
output_feature_type: str = config["output_features"][0]["type"]
output_feature_name: str = config["output_features"][0]["name"]
metric_dict = report[output_feature_name]
full_metric_names = get_metric_classes(output_feature_type)
metric_to_values: dict[str, float | int] = {
metric_name: metric_dict[metric_name] for metric_name in full_metric_names if metric_name in metric_dict
}
metric_names: set[str] = set(metric_to_values)
return MetricsSummary(
experiment_local_directory=experiment_local_directory,
config=config,
output_feature_name=output_feature_name,
output_feature_type=output_feature_type,
metric_to_values=metric_to_values,
metric_names=metric_names,
)
def build_metrics_diff(
dataset_name: str, base_experiment_name: str, experimental_experiment_name: str, local_directory: str
) -> MetricsDiff:
"""Build a MetricsDiff object between two experiments on a dataset.
:param dataset_name: the name of the Ludwig dataset.
:param base_experiment_name: the name of the base experiment.
:param experimental_experiment_name: the name of the experimental experiment.
:param local_directory: the local directory where the experiment artifacts are downloaded.
"""
base_summary: MetricsSummary = build_metrics_summary(
os.path.join(local_directory, dataset_name, base_experiment_name)
)
experimental_summary: MetricsSummary = build_metrics_summary(
os.path.join(local_directory, dataset_name, experimental_experiment_name)
)
metrics_in_common = set(base_summary.metric_names).intersection(set(experimental_summary.metric_names))
metrics: list[MetricDiff] = [
build_diff(name, base_summary.metric_to_values[name], experimental_summary.metric_to_values[name])
for name in metrics_in_common
]
return MetricsDiff(
dataset_name=dataset_name,
base_experiment_name=base_experiment_name,
experimental_experiment_name=experimental_experiment_name,
local_directory=local_directory,
base_summary=base_summary,
experimental_summary=experimental_summary,
metrics=metrics,
)
##############################
# Resource Usage Dataclasses #
##############################
@dataclass
class ResourceUsageSummary:
"""Summary of resource usage metrics from one experiment."""
# The tag with which the code block/function is labeled.
code_block_tag: str
# Dictionary that maps from metric name to their values.
metric_to_values: dict[str, float | int]
# Names of metrics for the output feature.
metric_names: set[str]
@dataclass
class ResourceUsageDiff:
"""Store resource usage diffs for two experiments."""
# The tag with which the code block/function is labeled.
code_block_tag: str
# Name of the base experiment (the one we benchmark against).
base_experiment_name: str
# Name of the experimental experiment.
experimental_experiment_name: str
# `List[Diff]` containing diffs for metric of the two experiments.
metrics: list[MetricDiff]
def to_string(self):
ret = []
spacing_str = "{:<36} {:<20} {:<20} {:<20} {:<5}"
ret.append(
spacing_str.format(
"Metric Name",
self.base_experiment_name,
self.experimental_experiment_name,
"Diff",
"Diff Percentage",
)
)
for metric in sorted(self.metrics, key=lambda m: m.name):
diff_percentage = metric.diff_percentage
if isinstance(metric.diff_percentage, float):
diff_percentage = round(metric.diff_percentage, 3)
ret.append(
spacing_str.format(
metric.name,
metric.base_value_str,
metric.experimental_value_str,
metric.diff_str,
diff_percentage,
)
)
return "\n".join(ret)
def export_resource_usage_diff_to_csv(resource_usage_diff: ResourceUsageDiff, path: str):
"""Export resource usage metrics report to .csv.
:param resource_usage_diff: ResourceUsageDiff object containing the diff for two experiments on a dataset.
:param path: file name of the exported csv.
"""
with open(path, "w", newline="") as f:
writer = csv.DictWriter(
f,
fieldnames=[
"Code Block Tag",
"Metric Name",
resource_usage_diff.base_experiment_name,
resource_usage_diff.experimental_experiment_name,
"Diff",
"Diff Percentage",
],
)
writer.writeheader()
for metric in sorted(resource_usage_diff.metrics, key=lambda m: m.name):
diff_percentage = metric.diff_percentage
if isinstance(metric.diff_percentage, float):
diff_percentage = round(metric.diff_percentage, 3)
writer.writerow(
{
"Code Block Tag": resource_usage_diff.code_block_tag,
"Metric Name": metric.name,
resource_usage_diff.base_experiment_name: metric.base_value_str,
resource_usage_diff.experimental_experiment_name: metric.experimental_value_str,
"Diff": metric.diff_str,
"Diff Percentage": diff_percentage,
}
)
logger.info(f"Exported a CSV report to {path}\n")
def average_runs(path_to_runs_dir: str) -> dict[str, int | float]:
"""Return average metrics from code blocks/function that ran more than once.
Metrics for code blocks/functions that were executed exactly once will be returned as is.
:param path_to_runs_dir: path to where metrics specific to a tag are stored.
e.g. resource_usage_out_dir/torch_ops_resource_usage/LudwigModel.evaluate/
This directory will contain JSON files with the following pattern run_*.json
"""
runs = [load_json(os.path.join(path_to_runs_dir, run)) for run in os.listdir(path_to_runs_dir)]
# asserting that keys to each of the dictionaries are consistent throughout the runs.
assert len(runs) == 1 or all(runs[i].keys() == runs[i + 1].keys() for i in range(len(runs) - 1))
runs_average = {"num_runs": len(runs)}
for key in runs[0]:
if isinstance(runs[0][key], (int, float)):
runs_average[key] = mean([run[key] for run in runs])
return runs_average
def summarize_resource_usage(path: str, tags: list[str] | None = None) -> list[ResourceUsageSummary]:
"""Create resource usage summaries for each code block/function that was decorated with ResourceUsageTracker.
Each entry of the list corresponds to the metrics collected from a code block/function run.
Important: code blocks that ran more than once are averaged.
:param path: corresponds to the `output_dir` argument in a ResourceUsageTracker run.
:param tags: (optional) list of tags to create summary for. If None, metrics from all tags will be summarized.
"""
summary = dict()
# metric types: system_resource_usage, torch_ops_resource_usage.
all_metric_types = {"system_resource_usage", "torch_ops_resource_usage"}
for metric_type in all_metric_types.intersection(os.listdir(path)):
metric_type_path = os.path.join(path, metric_type)
# code block tags correspond to the `tag` argument in ResourceUsageTracker.
for code_block_tag in os.listdir(metric_type_path):
if tags and code_block_tag not in tags:
continue
if code_block_tag not in summary:
summary[code_block_tag] = {}
run_path = os.path.join(metric_type_path, code_block_tag)
# Metrics from code blocks/functions that ran more than once are averaged.
summary[code_block_tag][metric_type] = average_runs(run_path)
summary_list = []
for code_block_tag, metric_type_dicts in summary.items():
merged_summary: dict[str, float | int] = {}
for metrics in metric_type_dicts.values():
assert "num_runs" in metrics
assert "num_runs" not in merged_summary or metrics["num_runs"] == merged_summary["num_runs"]
merged_summary.update(metrics)
summary_list.append(
ResourceUsageSummary(
code_block_tag=code_block_tag, metric_to_values=merged_summary, metric_names=set(merged_summary)
)
)
return summary_list
def build_resource_usage_diff(
base_path: str,
experimental_path: str,
base_experiment_name: str | None = None,
experimental_experiment_name: str | None = None,
) -> list[ResourceUsageDiff]:
"""Build and return a ResourceUsageDiff object to diff resource usage metrics between two experiments.
:param base_path: corresponds to the `output_dir` argument in the base ResourceUsageTracker run.
:param experimental_path: corresponds to the `output_dir` argument in the experimental ResourceUsageTracker run.
"""
base_summary_list = summarize_resource_usage(base_path)
experimental_summary_list = summarize_resource_usage(experimental_path)
summaries_list = []
for base_summary in base_summary_list:
for experimental_summary in experimental_summary_list:
if base_summary.code_block_tag == experimental_summary.code_block_tag:
summaries_list.append((base_summary, experimental_summary))
diffs = []
for base_summary, experimental_summary in summaries_list:
metrics_in_common = set(base_summary.metric_names).intersection(set(experimental_summary.metric_names))
metrics: list[MetricDiff] = [
build_diff(name, base_summary.metric_to_values[name], experimental_summary.metric_to_values[name])
for name in metrics_in_common
]
diff = ResourceUsageDiff(
code_block_tag=base_summary.code_block_tag,
base_experiment_name=base_experiment_name if base_experiment_name else "experiment_1",
experimental_experiment_name=(
experimental_experiment_name if experimental_experiment_name else "experiment_2"
),
metrics=metrics,
)
diffs.append(diff)
return diffs
================================================
FILE: ludwig/benchmarking/utils.py
================================================
import asyncio
import functools
import logging
import os
import shutil
import uuid
from concurrent.futures import ThreadPoolExecutor
from types import ModuleType
from typing import Any
import fsspec
import pandas as pd
import yaml
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BINARY, CATEGORY
from ludwig.datasets import model_configs_for_dataset
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
from ludwig.globals import CONFIG_YAML, MODEL_FILE_NAME, MODEL_WEIGHTS_FILE_NAME
from ludwig.utils.data_utils import load_yaml
from ludwig.utils.dataset_utils import get_repeatable_train_val_test_split
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.fs_utils import get_fs_and_path
HYPEROPT_OUTDIR_RETAINED_FILES = [
"hyperopt_statistics.json",
"params.json",
"stderr",
"stdout",
"result.json",
"error.txt",
]
logger = logging.getLogger()
def load_from_module(
dataset_module: DatasetLoader | ModuleType, output_feature: dict[str, str], subsample_frac: float = 1
) -> pd.DataFrame:
"""Load the ludwig dataset, optionally subsamples it, and returns a repeatable split. A stratified split is
used for classification datasets.
Args:
dataset_module: ludwig datasets module (e.g. ludwig.datasets.sst2, ludwig.datasets.ames_housing, etc.)
subsample_frac: percentage of the total dataset to load.
"""
dataset = dataset_module.load(split=False)
if subsample_frac < 1:
dataset = dataset.sample(frac=subsample_frac, replace=False, random_state=default_random_seed)
if output_feature["type"] in [CATEGORY, BINARY]:
return get_repeatable_train_val_test_split(
dataset,
stratify_colname=output_feature["name"],
random_seed=default_random_seed,
)
else:
return get_repeatable_train_val_test_split(dataset, random_seed=default_random_seed)
def export_artifacts(experiment: dict[str, str], experiment_output_directory: str, export_base_path: str):
"""Save the experiment artifacts to the `bench_export_directory`.
Args:
experiment: experiment dict that contains "dataset_name" (e.g. ames_housing),
"experiment_name" (specified by user), and "config_path" (path to experiment config.
Relative to ludwig/benchmarks/configs).
experiment_output_directory: path where the model, data, and logs of the experiment are saved.
export_base_path: remote or local path (directory) where artifacts are
exported. (e.g. s3://benchmarking.us-west-2.ludwig.com/bench/ or your/local/bench/)
"""
protocol, _ = fsspec.core.split_protocol(export_base_path)
fs, _ = get_fs_and_path(export_base_path)
try:
export_full_path = os.path.join(export_base_path, experiment["dataset_name"], experiment["experiment_name"])
# override previous experiment with the same name
if fs.exists(export_full_path):
fs.rm(export_full_path, recursive=True)
fs.put(experiment_output_directory, export_full_path, recursive=True)
fs.put(
os.path.join(experiment["config_path"]),
os.path.join(export_full_path, CONFIG_YAML),
)
logger.info(f"Uploaded experiment artifact to\n\t{export_full_path}")
except Exception:
logger.exception(
f"Failed to upload experiment artifacts for experiment *{experiment['experiment_name']}* on "
f"dataset {experiment['dataset_name']}"
)
def download_artifacts(
bench_config_path: str,
base_experiment: str,
experimental_experiment: str,
download_base_path: str,
local_dir: str = "benchmarking_summaries",
) -> tuple[str, list[str]]:
"""Download benchmarking artifacts for two experiments.
Args:
bench_config_path: bench config file path. Can be the same one that was used to run
these experiments.
base_experiment: name of the experiment we're comparing against.
experimental_experiment: name of the experiment we're comparing.
download_base_path: base path under which live the stored artifacts of
the benchmarking experiments.
"""
bench_config = load_yaml(bench_config_path)
protocol, _ = fsspec.core.split_protocol(download_base_path)
fs, _ = get_fs_and_path(download_base_path)
os.makedirs(local_dir, exist_ok=True)
coroutines = []
for experiment in bench_config["experiments"]:
dataset_name = experiment["dataset_name"]
for experiment_name in [base_experiment, experimental_experiment]:
coroutines.append(download_one(fs, download_base_path, dataset_name, experiment_name, local_dir))
downloaded_names = asyncio.run(asyncio.gather(*coroutines, return_exceptions=True))
dataset_names = [experiment_tuple[0] for experiment_tuple in set(downloaded_names) if experiment_tuple[0]]
assert (
len({experiment_tuple[1] for experiment_tuple in downloaded_names}) == 1 and downloaded_names[0][1] == local_dir
), "Experiments not downloaded to the same path"
return local_dir, dataset_names
@DeveloperAPI
async def download_one(
fs, download_base_path: str, dataset_name: str, experiment_name: str, local_dir: str
) -> tuple[str, str]:
"""Download `config.yaml` and `report.json` for an experiment.
Args:
fs: filesystem to use to download.
download_base_path: base path under which live the stored artifacts of
the benchmarking experiments.
dataset_name: name of the dataset we ran the experiments on.
experiment_name: name of the experiment (e.g. `v0.5.3_with_bert`)
local_dir: local directory under which the artifacts will be downloaded.
"""
loop = asyncio.get_running_loop()
local_experiment_dir = os.path.join(local_dir, dataset_name, experiment_name)
remote_experiment_directory = os.path.join(download_base_path, dataset_name, experiment_name)
os.makedirs(local_experiment_dir, exist_ok=True)
try:
with ThreadPoolExecutor() as pool:
func = functools.partial(
fs.get,
remote_experiment_directory,
local_experiment_dir,
recursive=True,
)
await loop.run_in_executor(pool, func)
except Exception:
logger.exception(f"Couldn't download experiment *{experiment_name}* of dataset *{dataset_name}*.")
return "", local_dir
return dataset_name, local_dir
def validate_benchmarking_config(benchmarking_config: dict[str, Any]) -> None:
"""Validates the parameters of the benchmarking config.
Args:
benchmarking_config: benchmarking config dictionary.
Raises:
ValueError if any of the expected parameters is not there.
"""
if "experiment_name" not in benchmarking_config and not all(
"experiment_name" in experiment for experiment in benchmarking_config["experiments"]
):
raise ValueError("You must either specify a global experiment name or an experiment name for each experiment.")
if "export" not in benchmarking_config:
raise ValueError("""You must specify export parameters. Example:
export:
export_artifacts: true
export_base_path: s3://benchmarking.us-west-2.ludwig.com/bench/ # include the slash at the end.
""")
if "experiments" not in benchmarking_config:
raise ValueError("You must specify a list of experiments.")
for experiment in benchmarking_config["experiments"]:
if "dataset_name" not in experiment:
raise ValueError("A Ludwig dataset must be specified.")
def populate_benchmarking_config_with_defaults(benchmarking_config: dict[str, Any]) -> dict[str, Any]:
"""Populates the parameters of the benchmarking config with defaults.
Args:
benchmarking_config: benchmarking config dictionary.
"""
if "hyperopt" not in benchmarking_config:
benchmarking_config["hyperopt"] = False
if "process_config_file_path" not in benchmarking_config:
benchmarking_config["process_config_file_path"] = None
if "profiler" not in benchmarking_config:
benchmarking_config["profiler"] = {"enable": False, "use_torch_profiler": False, "logging_interval": None}
return benchmarking_config
def propagate_global_parameters(benchmarking_config: dict[str, Any]) -> dict[str, Any]:
"""Propagate the global parameters of the benchmarking config to local experiments.
Args:
benchmarking_config: benchmarking config dictionary.
"""
for experiment in benchmarking_config["experiments"]:
if "experiment_name" not in experiment:
experiment["experiment_name"] = benchmarking_config["experiment_name"]
if "export" not in experiment:
experiment["export"] = benchmarking_config["export"]
if "hyperopt" not in experiment:
experiment["hyperopt"] = benchmarking_config["hyperopt"]
if "process_config_file_path" not in experiment:
experiment["process_config_file_path"] = benchmarking_config["process_config_file_path"]
if "profiler" not in experiment:
experiment["profiler"] = benchmarking_config["profiler"]
return benchmarking_config
def create_default_config(experiment: dict[str, Any]) -> str:
"""Create a Ludwig config that only contains input and output features.
Args:
experiment: experiment dictionary.
Returns:
path where the default config is saved.
"""
model_config = model_configs_for_dataset(experiment["dataset_name"])["default"]
# only keep input_features and output_features
main_config_keys = list(model_config.keys())
for key in main_config_keys:
if key not in ["input_features", "output_features"]:
del model_config[key]
config_path = f"{experiment['dataset_name']}-{uuid.uuid4().hex}.yaml"
save_yaml(config_path, model_config)
return config_path
def delete_model_checkpoints(output_directory: str):
"""Deletes outputs of the experiment run that we don't want to save with the artifacts.
Args:
output_directory: output directory of the hyperopt run.
"""
shutil.rmtree(os.path.join(output_directory, MODEL_FILE_NAME, "training_checkpoints"), ignore_errors=True)
if os.path.isfile(os.path.join(output_directory, MODEL_FILE_NAME, MODEL_WEIGHTS_FILE_NAME)):
os.remove(os.path.join(output_directory, MODEL_FILE_NAME, MODEL_WEIGHTS_FILE_NAME))
def delete_hyperopt_outputs(output_directory: str):
"""Deletes outputs of the hyperopt run that we don't want to save with the artifacts.
Args:
output_directory: output directory of the hyperopt run.
"""
for path, currentDirectory, files in os.walk(output_directory):
for file in files:
filename = os.path.join(path, file)
if file not in HYPEROPT_OUTDIR_RETAINED_FILES:
os.remove(filename)
def save_yaml(filename, dictionary):
with open(filename, "w") as f:
yaml.dump(dictionary, f, default_flow_style=False)
def format_time(time_us):
"""Defines how to format time in FunctionEvent.
from https://github.com/pytorch/pytorch/blob/master/torch/autograd/profiler_util.py
"""
US_IN_SECOND = 1000.0 * 1000.0
US_IN_MS = 1000.0
if time_us >= US_IN_SECOND:
return f"{time_us / US_IN_SECOND:.3f}s"
if time_us >= US_IN_MS:
return f"{time_us / US_IN_MS:.3f}ms"
return f"{time_us:.3f}us"
def format_memory(nbytes):
"""Returns a formatted memory size string.
from https://github.com/pytorch/pytorch/blob/master/torch/autograd/profiler_util.py
"""
KB = 1024
MB = 1024 * KB
GB = 1024 * MB
if abs(nbytes) >= GB:
return f"{nbytes * 1.0 / GB:.2f} Gb"
elif abs(nbytes) >= MB:
return f"{nbytes * 1.0 / MB:.2f} Mb"
elif abs(nbytes) >= KB:
return f"{nbytes * 1.0 / KB:.2f} Kb"
else:
return str(nbytes) + " b"
================================================
FILE: ludwig/callbacks.py
================================================
# !/usr/bin/env python
# Copyright (c) 2021 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from abc import ABC
from collections.abc import Callable
from typing import Any
from ludwig.api_annotations import PublicAPI
from ludwig.types import HyperoptConfigDict, ModelConfigDict, TrainingSetMetadataDict
@PublicAPI
class Callback(ABC):
def on_cmdline(self, cmd: str, *args: list[str]):
"""Called when Ludwig is run on the command line with the callback enabled.
:param cmd: The Ludwig subcommand being run, ex. "train", "evaluate", "predict", ...
:param args: The full list of command-line arguments (sys.argv).
"""
def on_preprocess_start(self, config: ModelConfigDict, **kwargs):
"""Called before preprocessing starts.
:param config: The config dictionary.
"""
def on_preprocess_end(
self,
training_set,
validation_set,
test_set,
training_set_metadata: TrainingSetMetadataDict,
**kwargs,
):
"""Called after preprocessing ends.
:param training_set: The training set.
:type training_set: ludwig.dataset.base.Dataset
:param validation_set: The validation set.
:type validation_set: ludwig.dataset.base.Dataset
:param test_set: The test set.
:type test_set: ludwig.dataset.base.Dataset
:param training_set_metadata: Values inferred from the training set, including preprocessing settings,
vocabularies, feature statistics, etc. Same as training_set_metadata.json.
"""
def on_hyperopt_init(self, experiment_name: str, **kwargs):
"""Called to initialize state before hyperparameter optimization begins.
:param experiment_name: The name of the current experiment.
"""
def on_hyperopt_preprocessing_start(self, experiment_name: str, **kwargs):
"""Called before data preprocessing for hyperparameter optimization begins.
:param experiment_name: The name of the current experiment.
"""
def on_hyperopt_preprocessing_end(self, experiment_name: str, **kwargs):
"""Called after data preprocessing for hyperparameter optimization is completed.
:param experiment_name: The name of the current experiment.
"""
def on_hyperopt_start(self, experiment_name: str, **kwargs):
"""Called before any hyperparameter optimization trials are started.
:param experiment_name: The name of the current experiment.
"""
def on_hyperopt_end(self, experiment_name: str, **kwargs):
"""Called after all hyperparameter optimization trials are completed.
:param experiment_name: The name of the current experiment.
"""
def on_hyperopt_finish(self, experiment_name: str, **kwargs):
"""Deprecated.
Use on_hyperopt_end instead.
"""
# TODO(travis): remove in favor of on_hyperopt_end for naming consistency
def on_hyperopt_trial_start(self, parameters: HyperoptConfigDict, **kwargs):
"""Called before the start of each hyperparameter optimization trial.
:param parameters: The complete dictionary of parameters for this hyperparameter optimization experiment.
"""
def on_hyperopt_trial_end(self, parameters: HyperoptConfigDict, **kwargs):
"""Called after the end of each hyperparameter optimization trial.
:param parameters: The complete dictionary of parameters for this hyperparameter optimization experiment.
"""
def should_stop_hyperopt(self):
"""Returns true if the entire hyperopt run (all trials) should be stopped.
See: https://docs.ray.io/en/latest/tune/api_docs/stoppers.html#ray.tune.Stopper
"""
return False
def on_resume_training(self, is_coordinator: bool, **kwargs):
pass
def on_train_init(
self,
base_config: ModelConfigDict,
experiment_directory: str,
experiment_name: str,
model_name: str,
output_directory: str,
resume_directory: str | None,
**kwargs,
):
"""Called after preprocessing, but before the creation of the model and trainer objects.
:param base_config: The user-specified config, before the insertion of defaults or inferred values.
:param experiment_directory: The experiment directory, same as output_directory if no experiment specified.
:param experiment_name: The experiment name.
:param model_name: The model name.
:param output_directory: file path to where training results are stored.
:param resume_directory: model directory to resume training from, or None.
"""
def on_train_start(
self,
model,
config: ModelConfigDict,
config_fp: str | None,
**kwargs,
):
"""Called after creation of trainer, before the start of training.
:param model: The ludwig model.
:type model: ludwig.utils.torch_utils.LudwigModule
:param config: The config dictionary.
:param config_fp: The file path to the config, or none if config was passed to stdin.
"""
def on_train_end(self, output_directory: str, **kwargs):
"""Called at the end of training, before the model is saved.
:param output_directory: file path to where training results are stored.
"""
def on_trainer_train_setup(self, trainer, save_path: str, is_coordinator: bool, **kwargs):
"""Called in every trainer (distributed or local) before training starts.
:param trainer: The trainer instance.
:type trainer: trainer: ludwig.models.Trainer
:param save_path: The path to the directory model is saved in.
:param is_coordinator: Is this trainer the coordinator.
"""
def on_trainer_train_teardown(self, trainer, progress_tracker, save_path: str, is_coordinator: bool, **kwargs):
"""Called in every trainer (distributed or local) after training completes.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
:param is_coordinator: Is this trainer the coordinator.
"""
def on_batch_start(self, trainer, progress_tracker, save_path: str, **kwargs):
"""Called on coordinator only before each batch.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
"""
def on_batch_end(self, trainer, progress_tracker, save_path: str, sync_step: bool = True, **kwargs):
"""Called on coordinator only after each batch.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
:param sync_step: Whether the model params were updated and synced in this step.
"""
def on_eval_start(self, trainer, progress_tracker, save_path: str, **kwargs):
"""Called on coordinator at the start of evaluation.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
"""
def on_eval_end(self, trainer, progress_tracker, save_path: str, **kwargs):
"""Called on coordinator at the end of evaluation.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
"""
def on_epoch_start(self, trainer, progress_tracker, save_path: str, **kwargs):
"""Called on coordinator only before the start of each epoch.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
"""
def on_epoch_end(self, trainer, progress_tracker, save_path: str, **kwargs):
"""Called on coordinator only after the end of each epoch.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
"""
def on_validation_start(self, trainer, progress_tracker, save_path: str, **kwargs):
"""Called on coordinator before validation starts.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
"""
def on_validation_end(self, trainer, progress_tracker, save_path: str, **kwargs):
"""Called on coordinator after validation is complete.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
"""
def on_test_start(self, trainer, progress_tracker, save_path: str, **kwargs):
"""Called on coordinator before testing starts.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
"""
def on_test_end(self, trainer, progress_tracker, save_path: str, **kwargs):
"""Called on coordinator after testing ends.
:param trainer: The trainer instance.
:type trainer: ludwig.models.trainer.Trainer
:param progress_tracker: An object which tracks training progress.
:type progress_tracker: ludwig.utils.trainer_utils.ProgressTracker
:param save_path: The path to the directory model is saved in.
"""
def should_early_stop(self, trainer, progress_tracker, is_coordinator, **kwargs):
# Triggers early stopping if any callback on any worker returns True
return False
def on_checkpoint(self, trainer, progress_tracker, **kwargs):
"""Called after each checkpoint is passed, regardless of whether the model was evaluated or saved at that
checkpoint."""
def on_save_best_checkpoint(self, trainer, progress_tracker, save_path, **kwargs):
"""Called on every worker immediately after a new best model is checkpointed."""
def on_build_metadata_start(self, df, mode: str, **kwargs):
"""Called before building metadata for dataset.
:param df: The dataset.
:type df: pd.DataFrame
:param mode: "prediction", "training", or None.
"""
def on_build_metadata_end(self, df, mode, **kwargs):
"""Called after building dataset metadata.
:param df: The dataset.
:type df: pd.DataFrame
:param mode: "prediction", "training", or None.
"""
def on_build_data_start(self, df, mode, **kwargs):
"""Called before build_data, which does preprocessing, handling missing values, adding metadata to
training_set_metadata.
:param df: The dataset.
:type df: pd.DataFrame
:param mode: "prediction", "training", or None.
"""
def on_build_data_end(self, df, mode, **kwargs):
"""Called after build_data completes.
:param df: The dataset.
:type df: pd.DataFrame
:param mode: "prediction", "training", or None.
"""
def on_evaluation_start(self, **kwargs):
"""Called before preprocessing for evaluation."""
def on_evaluation_end(self, **kwargs):
"""Called after evaluation is complete."""
def on_visualize_figure(self, fig, **kwargs):
"""Called after a visualization is generated.
:param fig: The figure.
:type fig: matplotlib.figure.Figure
"""
def on_ludwig_end(self, **kwargs):
"""Convenience method for any cleanup.
Not yet implemented.
"""
def prepare_ray_tune(
self,
train_fn: Callable,
tune_config: dict[str, Any],
tune_callbacks: list[Callable],
**kwargs,
):
"""Configures Ray Tune callback and config.
:param train_fn: The function which runs the experiment trial.
:param tune_config: The ray tune configuration dictionary.
:param tune_callbacks: List of callbacks (not used yet).
:returns: Tuple[Callable, Dict] The train_fn and tune_config, which will be passed to ray tune.
"""
return train_fn, tune_config
================================================
FILE: ludwig/check.py
================================================
import argparse
import logging
import tempfile
from ludwig.api import LudwigModel
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import INPUT_FEATURES, OUTPUT_FEATURES, TRAINER
from ludwig.data.dataset_synthesizer import build_synthetic_dataset_df
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig
NUM_EXAMPLES = 100
@DeveloperAPI
def check_install(logging_level: int = logging.INFO, **kwargs):
config = {
INPUT_FEATURES: [
{"name": "in1", "type": "text"},
{"name": "in2", "type": "category"},
{"name": "in3", "type": "number"},
],
OUTPUT_FEATURES: [{"name": "out1", "type": "binary"}],
TRAINER: {"epochs": 2, "batch_size": 8},
}
try:
df = build_synthetic_dataset_df(NUM_EXAMPLES, config)
model = LudwigModel(config, logging_level=logging_level)
with tempfile.TemporaryDirectory() as tmpdir:
model.train(dataset=df, output_directory=tmpdir)
except Exception:
print("=== CHECK INSTALL COMPLETE... FAILURE ===")
raise
print("=== CHECK INSTALL COMPLETE... SUCCESS ===")
@DeveloperAPI
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This command checks Ludwig installation on a synthetic dataset.",
prog="ludwig check_install",
usage="%(prog)s [options]",
)
parser.add_argument(
"-l",
"--logging_level",
default="warning",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
args = parser.parse_args(sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.check")
print_ludwig("Check Install", LUDWIG_VERSION)
check_install(**vars(args))
================================================
FILE: ludwig/cli.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import sys
import ludwig.contrib
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.print_utils import get_logo
class CLI:
"""CLI describes a command line interface for interacting with Ludwig.
Functions are described below.
"""
def __init__(self):
parser = argparse.ArgumentParser(
description="ludwig cli runner",
usage=f"""\n{get_logo("ludwig cli", LUDWIG_VERSION)}
ludwig []
Available sub-commands:
train Trains a model
predict Predicts using a pretrained model
evaluate Evaluate a pretrained model's performance
forecast Forecast the next n data points in a timeseries using a pretrained model
experiment Runs a full experiment training a model and evaluating it
hyperopt Perform hyperparameter optimization
benchmark Run and track experiments on a number of datasets and configs, and export experiment artifacts.
serve Serves a pretrained model
visualize Visualizes experimental results
collect_summary Prints names of weights and layers activations to use with other collect commands
collect_weights Collects tensors containing a pretrained model weights
collect_activations Collects tensors for each datapoint using a pretrained model
datasets Downloads and lists Ludwig-ready datasets
export_torchscript Exports Ludwig models to Torchscript
export_triton Exports Ludwig models to Triton
export_mlflow Exports Ludwig models to MLflow
export_schema Exports the Ludwig config JSON schema
preprocess Preprocess data and saves it into HDF5 and JSON format
synthesize_dataset Creates synthetic data for testing purposes
init_config Initialize a user config from a dataset and targets
render_config Renders the fully populated config with all defaults set
check_install Runs a quick training run on synthetic data to verify installation status
upload Push trained model artifacts to a registry (e.g., Predibase, HuggingFace Hub)
""",
)
parser.add_argument("command", help="Subcommand to run")
# parse_args defaults to [1:] for args, but you need to
# exclude the rest of the args too, or validation will fail
args = parser.parse_args(sys.argv[1:2])
if not hasattr(self, args.command):
print("Unrecognized command")
parser.print_help()
exit(1)
# use dispatch pattern to invoke method with same name
getattr(self, args.command)()
def train(self):
from ludwig import train
train.cli(sys.argv[2:])
def predict(self):
from ludwig import predict
predict.cli(sys.argv[2:])
def evaluate(self):
from ludwig import evaluate
evaluate.cli(sys.argv[2:])
def forecast(self):
from ludwig import forecast
forecast.cli(sys.argv[2:])
def experiment(self):
from ludwig import experiment
experiment.cli(sys.argv[2:])
def hyperopt(self):
from ludwig import hyperopt_cli
hyperopt_cli.cli(sys.argv[2:])
def benchmark(self):
from ludwig.benchmarking import benchmark
benchmark.cli(sys.argv[2:])
def serve(self):
from ludwig import serve
serve.cli(sys.argv[2:])
def visualize(self):
from ludwig import visualize
visualize.cli(sys.argv[2:])
def collect_summary(self):
from ludwig import collect
collect.cli_collect_summary(sys.argv[2:])
def collect_weights(self):
from ludwig import collect
collect.cli_collect_weights(sys.argv[2:])
def collect_activations(self):
from ludwig import collect
collect.cli_collect_activations(sys.argv[2:])
def export_torchscript(self):
from ludwig import export
export.cli_export_torchscript(sys.argv[2:])
def export_triton(self):
from ludwig import export
export.cli_export_triton(sys.argv[2:])
def export_mlflow(self):
from ludwig import export
export.cli_export_mlflow(sys.argv[2:])
def export_schema(self):
from ludwig.schema.export_schema import main as export_schema_main
export_schema_main(sys.argv[2:])
def preprocess(self):
from ludwig import preprocess
preprocess.cli(sys.argv[2:])
def synthesize_dataset(self):
from ludwig.data import dataset_synthesizer
dataset_synthesizer.cli(sys.argv[2:])
def init_config(self):
from ludwig import automl
automl.cli_init_config(sys.argv[2:])
def render_config(self):
from ludwig.utils import defaults
defaults.cli_render_config(sys.argv[2:])
def check_install(self):
from ludwig import check
check.cli(sys.argv[2:])
def datasets(self):
from ludwig import datasets
datasets.cli(sys.argv[2:])
def upload(self):
from ludwig import upload
upload.cli(sys.argv[2:])
def main():
ludwig.contrib.preload(sys.argv)
CLI()
if __name__ == "__main__":
main()
================================================
FILE: ludwig/collect.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import importlib
import logging
import os
import sys
import numpy as np
import torch
import torchinfo
from ludwig.api import LudwigModel
from ludwig.backend import ALL_BACKENDS, Backend
from ludwig.callbacks import Callback
from ludwig.constants import FULL, TEST, TRAINING, VALIDATION
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.print_utils import get_logging_level_registry, print_boxed, print_ludwig
from ludwig.utils.strings_utils import make_safe_filename
logger = logging.getLogger(__name__)
def collect_activations(
model_path: str,
layers: list[str],
dataset: str,
data_format: str = None,
split: str = FULL,
batch_size: int = 128,
output_directory: str = "results",
gpus: list[str] = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
callbacks: list[Callback] = None,
backend: Backend | str = None,
**kwargs,
) -> list[str]:
"""Uses the pretrained model to collect the tensors corresponding to a datapoint in the dataset. Saves the
tensors to the experiment directory.
# Inputs
:param model_path: (str) filepath to pre-trained model.
:param layers: (List[str]) list of strings for layer names in the model
to collect activations.
:param dataset: (str) source
containing the data to make predictions.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param split: (str, default: `full`) split on which
to perform predictions. Valid values are `'training'`, `'validation'`,
`'test'` and `'full'`.
:param batch_size: (int, default `128`) size of batches for processing.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param gpus: (list, default: `None`) list of GPUs that are available
for training.
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow PyTorch
to use multithreading parallelism to improve performance at
the cost of determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
# Return
:return: (List[str]) list of filepath to `*.npy` files containing
the activations.
"""
logger.info(f"Dataset path: {dataset}")
logger.info(f"Model path: {model_path}")
logger.info(f"Output path: {output_directory}")
logger.info("\n")
model = LudwigModel.load(
model_path,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
backend=backend,
)
# collect activations
print_boxed("COLLECT ACTIVATIONS")
collected_tensors = model.collect_activations(
layers, dataset, data_format=data_format, split=split, batch_size=batch_size
)
# saving
os.makedirs(output_directory, exist_ok=True)
saved_filenames = save_tensors(collected_tensors, output_directory)
logger.info(f"Saved to: {output_directory}")
return saved_filenames
def collect_weights(model_path: str, tensors: list[str], output_directory: str = "results", **kwargs) -> list[str]:
"""Loads a pretrained model and collects weights.
# Inputs
:param model_path: (str) filepath to pre-trained model.
:param tensors: (list, default: `None`) List of tensor names to collect
weights
:param output_directory: (str, default: `'results'`) the directory where
collected weights will be stored.
# Return
:return: (List[str]) list of filepath to `*.npy` files containing
the weights.
"""
logger.info(f"Model path: {model_path}")
logger.info(f"Output path: {output_directory}")
logger.info("\n")
model = LudwigModel.load(model_path)
# collect weights
print_boxed("COLLECT WEIGHTS")
collected_tensors = model.collect_weights(tensors)
# saving
os.makedirs(output_directory, exist_ok=True)
saved_filenames = save_tensors(collected_tensors, output_directory)
logger.info(f"Saved to: {output_directory}")
return saved_filenames
def save_tensors(collected_tensors, output_directory):
filenames = []
for tensor_name, tensor_value in collected_tensors:
np_filename = os.path.join(output_directory, make_safe_filename(tensor_name) + ".npy")
if isinstance(tensor_value, torch.Tensor):
# Skip non-tensor collected artifacts, e.g. used_tokens.
np.save(np_filename, tensor_value.detach().cpu().numpy())
filenames.append(np_filename)
return filenames
def print_model_summary(model_path: str, **kwargs) -> None:
"""Loads a pretrained model and prints names of weights and layers activations.
# Inputs
:param model_path: (str) filepath to pre-trained model.
# Return
:return: (`None`)
"""
model = LudwigModel.load(model_path)
# Move model to CPU for torchinfo summary to avoid device mismatch issues.
model.model.cpu()
logger.info(torchinfo.summary(model.model, input_data=[model.model.get_model_inputs()], depth=20))
logger.info("\nModules:\n")
for name, _ in model.model.named_children():
logger.info(name)
logger.info("\nParameters:\n")
for name, _ in model.model.named_parameters():
logger.info(name)
def pretrained_summary(pretrained_model: str, **kwargs) -> None:
"""Loads a pretrained model from Huggingface or Torchvision models and prints names of layers.
# Inputs
:param pretrained_model: (str) name of model to load (case sensitive).
# Return
:return: (`None`)
"""
from transformers import AutoConfig, AutoModel
model = None
# get access token if available
token = os.getenv("HUGGING_FACE_HUB_TOKEN")
if token is None:
logger.info("No token provided. Continuing loading without token access.")
elif not token:
raise ValueError("Invalid token provided. Exiting.")
else:
logger.info("Valid token provided. Proceeding with token access.")
# Try to load from transformers/HF
# TODO -> Fix OOM on large models e.g. llama 3 8B
try:
config = AutoConfig.from_pretrained(pretrained_model, token=token, low_cpu_mem_usage=True)
model = AutoModel.from_config(config=config)
logger.info(f"Loaded {pretrained_model} from Hugging Face Transformers.")
except Exception as e:
logger.error(f"Failed to load {pretrained_model} from Hugging Face Transformers: {e}")
# Try and load from torchvision-models
if model is None:
try:
module = importlib.import_module("torchvision.models")
model = getattr(module, pretrained_model)(weights=None)
except AttributeError:
logger.error(f"{pretrained_model} is not a valid torchvision model.")
if model:
for name, _ in model.named_parameters():
logger.info(name)
else:
logger.error(f"Unable to load the model {pretrained_model} from any known source.")
def cli_collect_activations(sys_argv):
"""Command Line Interface to communicate with the collection of tensors and there are several options that can
specified when calling this function:
--data_csv: Filepath for the input csv
--data_hdf5: Filepath for the input hdf5 file, if there is a csv file, this
is not read
--d: Refers to the dataset type of the file being read, by default is
*generic*
--s: Refers to the split of the data, can be one of: train, test,
validation, full
--m: Input model that is necessary to collect to the tensors, this is a
required *option*
--t: Tensors to collect
--od: Output directory of the model, defaults to results
--bs: Batch size
--g: Number of gpus that are to be used
--gf: Fraction of each GPUs memory to use.
--v: Verbose: Defines the logging level that the user will be exposed to
"""
parser = argparse.ArgumentParser(
description="This script loads a pretrained model and uses it collect "
"tensors for each datapoint in the dataset.",
prog="ludwig collect_activations",
usage="%(prog)s [options]",
)
# ---------------
# Data parameters
# ---------------
parser.add_argument("--dataset", help="input data file path", required=True)
parser.add_argument(
"--data_format",
help="format of the input data",
default="auto",
choices=[
"auto",
"csv",
"excel",
"feather",
"fwf",
"hdf5",
"html" "tables",
"json",
"jsonl",
"parquet",
"pickle",
"sas",
"spss",
"stata",
"tsv",
],
)
parser.add_argument(
"-s",
"--split",
default=FULL,
choices=[TRAINING, VALIDATION, TEST, FULL],
help="the split to obtain the model activations from",
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=True)
parser.add_argument("-lyr", "--layers", help="tensors to collect", nargs="+", required=True)
# -------------------------
# Output results parameters
# -------------------------
parser.add_argument(
"-od", "--output_directory", type=str, default="results", help="directory that contains the results"
)
# ------------------
# Generic parameters
# ------------------
parser.add_argument("-bs", "--batch_size", type=int, default=128, help="size of batches")
# ------------------
# Runtime parameters
# ------------------
parser.add_argument("-g", "--gpus", type=int, default=0, help="list of gpu to use")
parser.add_argument(
"-gml",
"--gpu_memory_limit",
type=float,
default=None,
help="maximum memory fraction [0, 1] allowed to allocate per GPU device",
)
parser.add_argument(
"-dpt",
"--disable_parallel_threads",
action="store_false",
dest="allow_parallel_threads",
help="disable PyTorch from using multithreading for reproducibility",
)
parser.add_argument(
"-b",
"--backend",
help="specifies backend to use for parallel / distributed execution, " "defaults to local execution",
choices=ALL_BACKENDS,
)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("collect_activations", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.collect")
print_ludwig("Collect Activations", LUDWIG_VERSION)
collect_activations(**vars(args))
def cli_collect_weights(sys_argv):
"""Command Line Interface to collecting the weights for the model.
--m: Input model that is necessary to collect to the tensors, this is a
required *option*
--t: Tensors to collect
--od: Output directory of the model, defaults to results
--v: Verbose: Defines the logging level that the user will be exposed to
"""
parser = argparse.ArgumentParser(
description="This script loads a pretrained model " "and uses it collect weights.",
prog="ludwig collect_weights",
usage="%(prog)s [options]",
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=True)
parser.add_argument("-t", "--tensors", help="tensors to collect", nargs="+", required=True)
# -------------------------
# Output results parameters
# -------------------------
parser.add_argument(
"-od", "--output_directory", type=str, default="results", help="directory that contains the results"
)
# ------------------
# Runtime parameters
# ------------------
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("collect_weights", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.collect")
print_ludwig("Collect Weights", LUDWIG_VERSION)
collect_weights(**vars(args))
def cli_collect_summary(sys_argv):
"""Command Line Interface to collecting a summary of the model layers and weights.
--m: Input model that is necessary to collect to the tensors
--pm: Model name in order to fetch from Huggingface or Torchvision
--v: Verbose: Defines the logging level that the user will be exposed to
"""
parser = argparse.ArgumentParser(
description="This script loads a pretrained model "
"and prints names of weights and layers activations "
"to use with other collect commands",
prog="ludwig collect_summary",
usage="%(prog)s [options]",
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=False)
parser.add_argument(
"-pm", "--pretrained_model", help="pretrained model to summarize (torchvision and huggingface)", required=False
)
# ------------------
# Runtime parameters
# ------------------
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("collect_summary", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.collect")
print_ludwig("Collect Summary", LUDWIG_VERSION)
if args.model_path:
print_model_summary(**vars(args))
elif args.pretrained_model and not args.model_path:
pretrained_summary(**vars(args))
if __name__ == "__main__":
if len(sys.argv) > 1:
if sys.argv[1] == "activations":
cli_collect_activations(sys.argv[2:])
elif sys.argv[1] == "weights":
cli_collect_weights(sys.argv[2:])
elif sys.argv[1] == "names":
cli_collect_summary(sys.argv[2:])
else:
print("Unrecognized command")
else:
print("Unrecognized command")
================================================
FILE: ludwig/combiners/__init__.py
================================================
================================================
FILE: ludwig/combiners/combiners.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from abc import ABC
from dataclasses import dataclass
from functools import lru_cache
import torch
from torch.nn import Linear, ModuleList
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BINARY, ENCODER_OUTPUT, NUMBER
from ludwig.encoders.registry import get_sequence_encoder_registry
from ludwig.features.base_feature import InputFeature
from ludwig.modules.attention_modules import TransformerStack
from ludwig.modules.embedding_modules import Embed
from ludwig.modules.fully_connected_modules import FCStack
from ludwig.modules.reduction_modules import SequenceReducer
from ludwig.modules.tabnet_modules import TabNet
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.comparator import ComparatorCombinerConfig
from ludwig.schema.combiners.concat import ConcatCombinerConfig
from ludwig.schema.combiners.project_aggregate import ProjectAggregateCombinerConfig
from ludwig.schema.combiners.sequence import SequenceCombinerConfig
from ludwig.schema.combiners.sequence_concat import SequenceConcatCombinerConfig
from ludwig.schema.combiners.tab_transformer import TabTransformerCombinerConfig
from ludwig.schema.combiners.tabnet import TabNetCombinerConfig
from ludwig.schema.combiners.transformer import TransformerCombinerConfig
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.registry import Registry
from ludwig.utils.torch_utils import LudwigModule, sequence_length_3D
from ludwig.utils.torch_utils import sequence_mask as torch_sequence_mask
logger = logging.getLogger(__name__)
@dataclass
class Handle:
"""This class provides an opaque handle to the input features, preventing them from being registered as state.
This is important because we already reference the `input_features` as an attribute of ECD, so we don't need it to
appear twice in the state_dict. Furthermore, DeepSpeed will get terribly confused if have the input features set as
an attribute of the combiner, and lead to shape mismatch errors when we go to load a saved checkpoint.
"""
input_features: dict[str, "InputFeature"]
@DeveloperAPI
class Combiner(LudwigModule, ABC):
"""Base class for combiners, which implements common properties.
Subclasses will usually override: __init__() to set properties and allocate resources. Should call
super().__init__(input_features). forward() performs the forward pass given a dictionary of encoder
outputs. get_schema_cls() must returns the class of the corresponding schema for the combiner type.
"""
def __init__(self, input_features: dict[str, "InputFeature"]):
super().__init__()
self.handle = Handle(input_features)
@property
def concatenated_shape(self) -> torch.Size:
# compute the size of the last dimension for the incoming encoder outputs
# this is required to setup the fully connected layer
shapes = [
torch.prod(torch.Tensor([*self.handle.input_features.get(k).output_shape]))
for k in self.handle.input_features
]
return torch.Size([torch.sum(torch.Tensor(shapes)).type(torch.int32)])
@property
def input_shape(self) -> dict:
# input to combiner is a dictionary of the input features encoder
# outputs, this property returns dictionary of output shapes for each
# input feature's encoder output shapes.
return {k: self.handle.input_features.get(k).output_shape for k in self.handle.input_features}
@property
@lru_cache(maxsize=1)
def output_shape(self) -> torch.Size:
pseudo_input = {}
for k in self.handle.input_features:
pseudo_input[k] = {
ENCODER_OUTPUT: torch.rand(
2, *self.handle.input_features.get(k).output_shape, dtype=self.input_dtype, device=self.device
)
}
output_tensor = self.forward(pseudo_input)
return output_tensor["combiner_output"].size()[1:]
combiner_impl_registry = Registry[type[Combiner]]()
def register_combiner(config_cls: type[BaseCombinerConfig]):
def wrap(cls: type[Combiner]):
combiner_impl_registry[config_cls] = cls
return cls
return wrap
def create_combiner(config: BaseCombinerConfig, **kwargs) -> Combiner:
return combiner_impl_registry[type(config)](config=config, **kwargs)
@register_combiner(ConcatCombinerConfig)
class ConcatCombiner(Combiner):
def __init__(self, input_features: dict[str, "InputFeature"] = None, config: ConcatCombinerConfig = None, **kwargs):
super().__init__(input_features)
self.name = "ConcatCombiner"
logger.debug(f" {self.name}")
self.flatten_inputs = config.flatten_inputs
self.fc_stack = None
# todo future: this may be redundant, check
fc_layers = config.fc_layers
if fc_layers is None:
fc_layers = []
for i in range(config.num_fc_layers):
fc_layers.append({"output_size": config.output_size})
self.fc_layers = fc_layers
logger.debug(" FCStack")
self.fc_stack = FCStack(
first_layer_input_size=self.concatenated_shape[-1],
layers=config.fc_layers,
num_layers=config.num_fc_layers,
default_output_size=config.output_size,
default_use_bias=config.use_bias,
default_weights_initializer=config.weights_initializer,
default_bias_initializer=config.bias_initializer,
default_norm=config.norm,
default_norm_params=config.norm_params,
default_activation=config.activation,
default_dropout=config.dropout,
residual=config.residual,
)
if input_features and len(input_features) == 1 and self.fc_layers is None:
self.supports_masking = True
def forward(self, inputs: dict) -> dict: # encoder outputs
encoder_outputs = [inputs[k][ENCODER_OUTPUT] for k in inputs]
# ================ Flatten ================
if self.flatten_inputs:
batch_size = encoder_outputs[0].shape[0]
encoder_outputs = [torch.reshape(eo, [batch_size, -1]) for eo in encoder_outputs]
# ================ Concat ================
if len(encoder_outputs) > 1:
hidden = torch.cat(encoder_outputs, 1)
else:
hidden = list(encoder_outputs)[0]
# ================ Fully Connected ================
hidden = self.fc_stack(hidden)
return_data = {"combiner_output": hidden}
if len(inputs) == 1:
# Workaround for including additional tensors from output of input encoders for
# potential use in decoders, e.g. LSTM state for seq2seq.
# TODO(Justin): Think about how to make this communication work for multi-sequence
# features. Other combiners.
for key, value in [d for d in inputs.values()][0].items():
if key != ENCODER_OUTPUT:
return_data[key] = value
return return_data
@register_combiner(SequenceConcatCombinerConfig)
class SequenceConcatCombiner(Combiner):
def __init__(
self, input_features: dict[str, "InputFeature"], config: SequenceConcatCombinerConfig = None, **kwargs
):
super().__init__(input_features)
self.name = "SequenceConcatCombiner"
logger.debug(f" {self.name}")
self.reduce_output = config.reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=config.reduce_output,
max_sequence_length=self.concatenated_shape[0],
encoding_size=self.concatenated_shape[1],
)
if self.reduce_output is None:
self.supports_masking = True
self.main_sequence_feature = config.main_sequence_feature
@property
def concatenated_shape(self) -> torch.Size:
# computes the effective shape of the input tensor after combining
# all the encoder outputs
# determine max sequence length by finding the first sequence tensor
# assume all the sequences are of the same size, if not true
# this will be caught during processing
seq_size = None
for k in self.handle.input_features:
# dim-2 output_shape implies a sequence [seq_size, hidden]
if len(self.handle.input_features.get(k).output_shape) == 2:
seq_size = self.handle.input_features.get(k).output_shape[0]
break
# collect the size of the last dimension for all input feature
# encoder outputs
shapes = [
self.handle.input_features.get(k).output_shape[-1] for k in self.handle.input_features
] # output shape not input shape
return torch.Size([seq_size, sum(shapes)])
def forward(self, inputs: dict) -> dict: # encoder outputs
if self.main_sequence_feature is None or self.main_sequence_feature not in inputs:
for if_name, if_outputs in inputs.items():
# todo: when https://github.com/ludwig-ai/ludwig/issues/810 is closed
# convert following test from using shape to use explicit
# if_outputs[TYPE] values for sequence features
if len(if_outputs[ENCODER_OUTPUT].shape) == 3:
self.main_sequence_feature = if_name
break
if self.main_sequence_feature is None:
raise Exception("No sequence feature available for sequence combiner")
main_sequence_feature_encoding = inputs[self.main_sequence_feature]
representation = main_sequence_feature_encoding[ENCODER_OUTPUT]
representations = [representation]
sequence_max_length = representation.shape[1]
sequence_length = sequence_length_3D(representation)
# ================ Concat ================
for if_name, if_outputs in inputs.items():
if if_name != self.main_sequence_feature:
if_representation = if_outputs[ENCODER_OUTPUT]
if len(if_representation.shape) == 3:
# The following check makes sense when
# both representations have a specified
# sequence length dimension. If they do not,
# then this check is simply checking if None == None
# and will not catch discrepancies in the different
# feature length dimension. Those errors will show up
# at training time. Possible solutions to this is
# to enforce a length second dimension in
# sequential feature placeholders, but that
# does not work with BucketedBatcher that requires
# the second dimension to be undefined in order to be
# able to trim the data points and speed up computation.
# So for now we are keeping things like this, make sure
# to write in the documentation that training time
# dimensions mismatch may occur if the sequential
# features have different lengths for some data points.
if if_representation.shape[1] != representation.shape[1]:
raise ValueError(
"The sequence length of the input feature {} "
"is {} and is different from the sequence "
"length of the main sequence feature {} which "
"is {}.\n Shape of {}: {}, shape of {}: {}.\n"
"Sequence lengths of all sequential features "
"must be the same in order to be concatenated "
"by the sequence concat combiner. "
"Try to impose the same max sequence length "
"as a preprocessing parameter to both features "
"or to reduce the output of {}.".format(
if_name,
if_representation.shape[1],
self.main_sequence_feature,
representation.shape[1],
if_name,
if_representation.shape,
if_name,
representation.shape,
if_name,
)
)
# this assumes all sequence representations have the
# same sequence length, 2nd dimension
representations.append(if_representation)
elif len(if_representation.shape) == 2:
multipliers = (1, sequence_max_length, 1)
tiled_representation = torch.tile(torch.unsqueeze(if_representation, 1), multipliers)
representations.append(tiled_representation)
else:
raise ValueError(
"The representation of {} has rank {} and cannot be"
" concatenated by a sequence concat combiner. "
"Only rank 2 and rank 3 tensors are supported.".format(if_name, len(if_representation.shape))
)
hidden = torch.cat(representations, 2)
logger.debug(f" concat_hidden: {hidden}")
# ================ Mask ================
sequence_mask = torch_sequence_mask(sequence_length, sequence_max_length)
hidden = torch.multiply(hidden, torch.unsqueeze(sequence_mask, -1).type(torch.float32))
# ================ Reduce ================
hidden = self.reduce_sequence(hidden)
return_data = {"combiner_output": hidden}
if len(inputs) == 1:
for key, value in [d for d in inputs.values()][0].items():
if key != ENCODER_OUTPUT:
return_data[key] = value
return return_data
@register_combiner(SequenceCombinerConfig)
class SequenceCombiner(Combiner):
def __init__(self, input_features: dict[str, "InputFeature"], config: SequenceCombinerConfig = None, **kwargs):
super().__init__(input_features)
self.name = "SequenceCombiner"
logger.debug(f" {self.name}")
self.combiner = SequenceConcatCombiner(
input_features,
config=SequenceConcatCombinerConfig(reduce_output=None, main_sequence_feature=config.main_sequence_feature),
)
logger.debug(
f"combiner input shape {self.combiner.concatenated_shape}, " f"output shape {self.combiner.output_shape}"
)
self.encoder_obj = get_from_registry(config.encoder.type, get_sequence_encoder_registry())(
should_embed=False,
reduce_output=config.reduce_output,
embedding_size=self.combiner.output_shape[1],
max_sequence_length=self.combiner.output_shape[0],
**kwargs,
)
if hasattr(self.encoder_obj, "supports_masking") and self.encoder_obj.supports_masking:
self.supports_masking = True
@property
def concatenated_shape(self) -> torch.Size:
# computes the effective shape of the input tensor after combining
# all the encoder outputs
# determine max sequence length by finding the first sequence tensor
# assume all the sequences are of the same size, if not true
# this will be caught during processing
seq_size = None
for k in self.handle.input_features:
# dim-2 output_shape implies a sequence [seq_size, hidden]
if len(self.handle.input_features.get(k).output_shape) == 2:
seq_size = self.handle.input_features.get(k).output_shape[0]
break
# collect the size of the last dimension for all input feature
# encoder outputs
shapes = [
self.handle.input_features.get(k).output_shape[-1] for k in self.handle.input_features
] # output shape not input shape
return torch.Size([seq_size, sum(shapes)])
def forward(self, inputs: dict) -> dict: # encoder outputs
# ================ Concat ================
hidden = self.combiner(inputs)
# ================ Sequence encoding ================
hidden = self.encoder_obj(hidden["combiner_output"])
return_data = {"combiner_output": hidden[ENCODER_OUTPUT]}
for key, value in hidden.items():
if key != ENCODER_OUTPUT:
return_data[key] = value
return return_data
@register_combiner(TabNetCombinerConfig)
class TabNetCombiner(Combiner):
def __init__(
self, input_features: dict[str, "InputFeature"], config: TabNetCombinerConfig = None, **kwargs
) -> None:
super().__init__(input_features)
self.name = "TabNetCombiner"
logger.debug(f" {self.name}")
self.tabnet = TabNet(
self.concatenated_shape[-1],
config.size,
config.output_size,
num_steps=config.num_steps,
num_total_blocks=config.num_total_blocks,
num_shared_blocks=config.num_shared_blocks,
relaxation_factor=config.relaxation_factor,
bn_epsilon=config.bn_epsilon,
bn_momentum=config.bn_momentum,
bn_virtual_bs=config.bn_virtual_bs,
sparsity=config.sparsity,
entmax_mode=config.entmax_mode,
entmax_alpha=config.entmax_alpha,
)
if config.dropout > 0:
self.dropout = torch.nn.Dropout(config.dropout)
else:
self.dropout = None
@property
def concatenated_shape(self) -> torch.Size:
# compute the size of the last dimension for the incoming encoder outputs
# this is required to setup
shapes = [
torch.prod(torch.Tensor([*self.handle.input_features.get(k).output_shape]))
for k in self.handle.input_features
]
return torch.Size([torch.sum(torch.Tensor(shapes)).type(torch.int32)])
def forward(
self,
inputs: torch.Tensor, # encoder outputs
) -> dict:
encoder_outputs = [inputs[k][ENCODER_OUTPUT] for k in inputs]
# ================ Flatten ================
batch_size = encoder_outputs[0].shape[0]
encoder_outputs = [torch.reshape(eo, [batch_size, -1]) for eo in encoder_outputs]
# ================ Concat ================
if len(encoder_outputs) > 1:
hidden = torch.cat(encoder_outputs, 1)
else:
hidden = list(encoder_outputs)[0]
# ================ TabNet ================
hidden, aggregated_mask, masks = self.tabnet(hidden)
if self.dropout:
hidden = self.dropout(hidden)
return_data = {
"combiner_output": hidden,
"aggregated_attention_masks": aggregated_mask,
"attention_masks": masks,
}
if len(inputs) == 1:
for key, value in [d for d in inputs.values()][0].items():
if key != ENCODER_OUTPUT:
return_data[key] = value
return return_data
@property
def output_shape(self) -> torch.Size:
return self.tabnet.output_shape
@register_combiner(TransformerCombinerConfig)
class TransformerCombiner(Combiner):
def __init__(
self, input_features: dict[str, "InputFeature"] = None, config: TransformerCombinerConfig = None, **kwargs
):
super().__init__(input_features)
self.name = "TransformerCombiner"
logger.debug(f" {self.name}")
self.reduce_output = config.reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=config.reduce_output,
max_sequence_length=len(input_features),
encoding_size=config.hidden_size,
)
if self.reduce_output is None:
self.supports_masking = True
# max sequence length for Transformer layer is number of input features
self.max_sequence_length = len(input_features)
logger.debug(" Projectors")
self.projectors = ModuleList(
# regardless of rank-2 or rank-3 input, torch.prod() calculates size
# after flattening the encoder output tensor
[
Linear(
torch.prod(torch.Tensor([*input_features.get(inp).output_shape])).type(torch.int32),
config.hidden_size,
)
for inp in input_features
]
)
logger.debug(" TransformerStack")
self.transformer_stack = TransformerStack(
input_size=config.hidden_size,
max_sequence_length=self.max_sequence_length,
hidden_size=config.hidden_size,
num_heads=config.num_heads,
output_size=config.transformer_output_size,
num_layers=config.num_layers,
dropout=config.dropout,
)
if self.reduce_output is not None:
logger.debug(" FCStack")
self.fc_stack = FCStack(
self.transformer_stack.output_shape[-1],
layers=config.fc_layers,
num_layers=config.num_fc_layers,
default_output_size=config.output_size,
default_use_bias=config.use_bias,
default_weights_initializer=config.weights_initializer,
default_bias_initializer=config.bias_initializer,
default_norm=config.norm,
default_norm_params=config.norm_params,
default_activation=config.fc_activation,
default_dropout=config.fc_dropout,
fc_residual=config.fc_residual,
)
def forward(
self,
inputs, # encoder outputs
) -> dict:
encoder_outputs = [inputs[k][ENCODER_OUTPUT] for k in inputs]
# ================ Flatten ================
batch_size = encoder_outputs[0].shape[0]
encoder_outputs = [torch.reshape(eo, [batch_size, -1]) for eo in encoder_outputs]
# ================ Project & Concat ================
projected = [self.projectors[i](eo) for i, eo in enumerate(encoder_outputs)]
hidden = torch.stack(projected) # shape [num_eo, bs, h]
hidden = torch.permute(hidden, (1, 0, 2)) # shape [bs, num_eo, h]
# ================ Transformer Layers ================
hidden = self.transformer_stack(hidden)
# ================ Sequence Reduction ================
if self.reduce_output is not None:
hidden = self.reduce_sequence(hidden)
# ================ FC Layers ================
hidden = self.fc_stack(hidden)
return_data = {"combiner_output": hidden}
if len(inputs) == 1:
for key, value in [d for d in inputs.values()][0].items():
if key != ENCODER_OUTPUT:
return_data[key] = value
return return_data
@register_combiner(TabTransformerCombinerConfig)
class TabTransformerCombiner(Combiner):
def __init__(
self, input_features: dict[str, "InputFeature"] = None, config: TabTransformerCombinerConfig = None, **kwargs
):
super().__init__(input_features)
self.name = "TabTransformerCombiner"
logger.debug(f"Initializing {self.name}")
self.reduce_output = config.reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=config.reduce_output, max_sequence_length=len(input_features), encoding_size=config.hidden_size
)
self.supports_masking = True
self.embed_input_feature_name = config.embed_input_feature_name
if self.embed_input_feature_name:
vocab = [
i_f
for i_f in input_features
if input_features.get(i_f).type() != NUMBER or input_features.get(i_f).type() != BINARY
]
if self.embed_input_feature_name == "add":
self.embed_i_f_name_layer = Embed(vocab, config.hidden_size, force_embedding_size=True)
projector_size = config.hidden_size
elif isinstance(self.embed_input_feature_name, int):
if self.embed_input_feature_name > config.hidden_size:
raise ValueError(
"TabTransformer parameter "
"`embed_input_feature_name` "
"specified integer value ({}) "
"needs to be smaller than "
"`hidden_size` ({}).".format(self.embed_input_feature_name, config.hidden_size)
)
self.embed_i_f_name_layer = Embed(
vocab,
self.embed_input_feature_name,
force_embedding_size=True,
)
projector_size = config.hidden_size - self.embed_input_feature_name
else:
raise ValueError(
"TabTransformer parameter "
"`embed_input_feature_name` "
"should be either None, an integer or `add`, "
"the current value is "
"{}".format(self.embed_input_feature_name)
)
else:
projector_size = config.hidden_size
logger.debug(" Projectors")
self.unembeddable_features = []
self.embeddable_features = []
for i_f in input_features:
if input_features.get(i_f).type() in {NUMBER, BINARY}:
self.unembeddable_features.append(i_f)
else:
self.embeddable_features.append(i_f)
self.projectors = ModuleList()
for i_f in self.embeddable_features:
flatten_size = self.get_flatten_size(input_features.get(i_f).output_shape)
self.projectors.append(Linear(flatten_size[0], projector_size))
# input to layer_norm are the encoder outputs for unembeddable features,
# which are number or binary features. These should be 2-dim
# tensors. Size should be concatenation of these tensors.
concatenated_unembeddable_encoders_size = 0
for i_f in self.unembeddable_features:
concatenated_unembeddable_encoders_size += input_features.get(i_f).output_shape[0]
# Skip LayerNorm when normalizing a single value — LayerNorm(1) always
# outputs zero which kills gradients for all downstream parameters.
if concatenated_unembeddable_encoders_size > 1:
self.layer_norm = torch.nn.LayerNorm(concatenated_unembeddable_encoders_size)
else:
self.layer_norm = torch.nn.Identity()
logger.debug(" TransformerStack")
self.transformer_stack = TransformerStack(
input_size=config.hidden_size,
max_sequence_length=len(self.embeddable_features),
hidden_size=config.hidden_size,
# todo: can we just use projector_size? # hidden_size,
num_heads=config.num_heads,
output_size=config.transformer_output_size,
num_layers=config.num_layers,
dropout=config.dropout,
)
logger.debug(" FCStack")
# determine input size to fully connected layer based on reducer
if config.reduce_output == "concat":
fc_input_size = len(self.embeddable_features) * config.hidden_size
else:
fc_input_size = self.reduce_sequence.output_shape[-1] if len(self.embeddable_features) > 0 else 0
self.fc_stack = FCStack(
fc_input_size + concatenated_unembeddable_encoders_size,
layers=config.fc_layers,
num_layers=config.num_fc_layers,
default_output_size=config.output_size,
default_use_bias=config.use_bias,
default_weights_initializer=config.weights_initializer,
default_bias_initializer=config.bias_initializer,
default_norm=config.norm,
default_norm_params=config.norm_params,
default_activation=config.fc_activation,
default_dropout=config.fc_dropout,
fc_residual=config.fc_residual,
)
self._empty_hidden = torch.empty([1, 0])
self._embeddable_features_indices = torch.arange(0, len(self.embeddable_features))
# Create empty tensor of shape [1, 0] to use as hidden in case there are no category or numeric/binary features.
self.register_buffer("empty_hidden", self._empty_hidden)
self.register_buffer("embeddable_features_indices", self._embeddable_features_indices)
@staticmethod
def get_flatten_size(output_shape: torch.Size) -> torch.Size:
size = torch.prod(torch.Tensor([*output_shape]))
return torch.Size([size.type(torch.int32)])
@property
def output_shape(self) -> torch.Size:
return self.fc_stack.output_shape
def forward(
self,
inputs: dict, # encoder outputs
) -> dict:
unembeddable_encoder_outputs = [inputs[k][ENCODER_OUTPUT] for k in inputs if k in self.unembeddable_features]
embeddable_encoder_outputs = [inputs[k][ENCODER_OUTPUT] for k in inputs if k in self.embeddable_features]
batch_size = (
embeddable_encoder_outputs[0].shape[0]
if len(embeddable_encoder_outputs) > 0
else unembeddable_encoder_outputs[0].shape[0]
)
# ================ Project & Concat embeddables ================
if len(embeddable_encoder_outputs) > 0:
# ============== Flatten =================
embeddable_encoder_outputs = [torch.reshape(eo, [batch_size, -1]) for eo in embeddable_encoder_outputs]
projected = [self.projectors[i](eo) for i, eo in enumerate(embeddable_encoder_outputs)]
hidden = torch.stack(projected) # num_eo, bs, h
hidden = torch.permute(hidden, (1, 0, 2)) # bs, num_eo, h
if self.embed_input_feature_name:
i_f_names_idcs = torch.reshape(
torch.arange(0, len(embeddable_encoder_outputs), device=self.device), [-1, 1]
)
embedded_i_f_names = self.embed_i_f_name_layer(i_f_names_idcs)
embedded_i_f_names = torch.unsqueeze(embedded_i_f_names, dim=0)
embedded_i_f_names = torch.tile(embedded_i_f_names, [batch_size, 1, 1])
if self.embed_input_feature_name == "add":
hidden = hidden + embedded_i_f_names
else:
hidden = torch.cat([hidden, embedded_i_f_names], -1)
# ================ Transformer Layers ================
hidden = self.transformer_stack(hidden)
# ================ Sequence Reduction ================
hidden = self.reduce_sequence(hidden)
else:
# create empty tensor because there are no category features
hidden = torch.empty([batch_size, 0], device=self.device)
# ================ Concat Skipped ================
if len(unembeddable_encoder_outputs) > 0:
unembeddable_encoder_outputs = [torch.reshape(eo, [batch_size, -1]) for eo in unembeddable_encoder_outputs]
# ================ Flatten ================
if len(unembeddable_encoder_outputs) > 1:
unembeddable_hidden = torch.cat(unembeddable_encoder_outputs, -1) # tf.keras.layers.concatenate
else:
unembeddable_hidden = list(unembeddable_encoder_outputs)[0]
unembeddable_hidden = self.layer_norm(unembeddable_hidden)
else:
# create empty tensor because there are not numeric/binary features
unembeddable_hidden = torch.tile(self.empty_hidden, [batch_size, 0])
# ================ Concat Skipped and Others ================
# When reduce_output is None, hidden is 3D [batch, seq, dim] but
# unembeddable_hidden is 2D [batch, dim]. Expand to match.
if hidden.dim() == 3 and unembeddable_hidden.dim() == 2:
unembeddable_hidden = unembeddable_hidden.unsqueeze(1).expand(-1, hidden.size(1), -1)
hidden = torch.cat([hidden, unembeddable_hidden], -1)
# ================ FC Layers ================
hidden = self.fc_stack(hidden)
return_data = {"combiner_output": hidden}
if len(inputs) == 1:
for key, value in [d for d in inputs.values()][0].items():
if key != ENCODER_OUTPUT:
return_data[key] = value
return return_data
@register_combiner(ComparatorCombinerConfig)
class ComparatorCombiner(Combiner):
def __init__(
self,
input_features: dict[str, "InputFeature"],
config: ComparatorCombinerConfig = None,
**kwargs,
):
super().__init__(input_features)
self.name = "ComparatorCombiner"
logger.debug(f"Entering {self.name}")
self.entity_1 = config.entity_1
self.entity_2 = config.entity_2
self.required_inputs = set(config.entity_1 + config.entity_2)
self.output_size = config.output_size
self.fc_stack = None
# todo future: this may be redundant, check
fc_layers = config.fc_layers
if fc_layers is None and config.num_fc_layers is not None:
fc_layers = []
for _ in range(config.num_fc_layers):
fc_layers.append({"output_size": config.output_size})
if fc_layers is not None:
logger.debug("Setting up FCStack")
self.e1_fc_stack = FCStack(
self.get_entity_shape(config.entity_1)[-1],
layers=fc_layers,
num_layers=config.num_fc_layers,
default_output_size=config.output_size,
default_use_bias=config.use_bias,
default_weights_initializer=config.weights_initializer,
default_bias_initializer=config.bias_initializer,
default_norm=config.norm,
default_norm_params=config.norm_params,
default_activation=config.activation,
default_dropout=config.dropout,
)
self.e2_fc_stack = FCStack(
self.get_entity_shape(config.entity_2)[-1],
layers=fc_layers,
num_layers=config.num_fc_layers,
default_output_size=config.output_size,
default_use_bias=config.use_bias,
default_weights_initializer=config.weights_initializer,
default_bias_initializer=config.bias_initializer,
default_norm=config.norm,
default_norm_params=config.norm_params,
default_activation=config.activation,
default_dropout=config.dropout,
)
self.last_fc_layer_output_size = fc_layers[-1]["output_size"]
# todo: set initializer and regularization
self.register_buffer(
"bilinear_weights",
torch.randn([self.last_fc_layer_output_size, self.last_fc_layer_output_size], dtype=torch.float32),
)
def get_entity_shape(self, entity: list) -> torch.Size:
sizes = [torch.prod(torch.Tensor([*self.handle.input_features.get(k).output_shape])) for k in entity]
return torch.Size([torch.sum(torch.Tensor(sizes)).type(torch.int32)])
@property
def output_shape(self) -> torch.Size:
return torch.Size([2 * self.last_fc_layer_output_size + 2])
def forward(
self,
inputs: dict, # encoder outputs
) -> dict[str, torch.Tensor]: # encoder outputs
if inputs.keys() != self.required_inputs:
raise ValueError(f"Missing inputs {self.required_inputs - set(inputs.keys())}")
############
# Entity 1 #
############
e1_enc_outputs = [inputs[k][ENCODER_OUTPUT] for k in self.entity_1]
# ================ Flatten ================
batch_size = e1_enc_outputs[0].shape[0]
e1_enc_outputs = [torch.reshape(eo, [batch_size, -1]) for eo in e1_enc_outputs]
# ================ Concat ================
if len(e1_enc_outputs) > 1:
e1_hidden = torch.cat(e1_enc_outputs, 1)
else:
e1_hidden = list(e1_enc_outputs)[0]
# ================ Fully Connected ================
e1_hidden = self.e1_fc_stack(e1_hidden) # [bs, output_size]
############
# Entity 2 #
############
e2_enc_outputs = [inputs[k][ENCODER_OUTPUT] for k in self.entity_2]
# ================ Flatten ================
batch_size = e2_enc_outputs[0].shape[0]
e2_enc_outputs = [torch.reshape(eo, [batch_size, -1]) for eo in e2_enc_outputs]
# ================ Concat ================
if len(e2_enc_outputs) > 1:
e2_hidden = torch.cat(e2_enc_outputs, 1)
else:
e2_hidden = list(e2_enc_outputs)[0]
# ================ Fully Connected ================
e2_hidden = self.e2_fc_stack(e2_hidden) # [bs, output_size]
###########
# Compare #
###########
if e1_hidden.shape != e2_hidden.shape:
raise ValueError(
f"Mismatching shapes among dimensions! "
f"entity1 shape: {e1_hidden.shape} "
f"entity2 shape: {e2_hidden.shape}"
)
element_wise_mul = e1_hidden * e2_hidden # [bs, output_size]
dot_product = torch.sum(element_wise_mul, 1, keepdim=True) # [bs, 1]
abs_diff = torch.abs(e1_hidden - e2_hidden) # [bs, output_size]
bilinear_prod = torch.sum(
torch.mm(e1_hidden, self.bilinear_weights) * e2_hidden, dim=1, keepdim=True
) # [bs, 1]
logger.debug(
"preparing combiner output by concatenating these tensors: "
f"dot_product: {dot_product.shape}, element_size_mul: {element_wise_mul.shape}"
f", abs_diff: {abs_diff.shape}, bilinear_prod {bilinear_prod.shape}"
)
hidden = torch.cat([dot_product, element_wise_mul, abs_diff, bilinear_prod], 1) # [bs, 2 * output_size + 2]
return {"combiner_output": hidden}
@register_combiner(ProjectAggregateCombinerConfig)
class ProjectAggregateCombiner(Combiner):
def __init__(
self, input_features: dict[str, "InputFeature"] = None, config: ProjectAggregateCombinerConfig = None, **kwargs
):
super().__init__(input_features)
self.name = "ProjectAggregateCombiner"
logger.debug(f" {self.name}")
logger.debug(" Projectors")
self.projectors = ModuleList(
# regardless of rank-2 or rank-3 input, torch.prod() calculates size
# after flattening the encoder output tensor
[
Linear(
torch.prod(torch.Tensor([*input_features.get(inp).output_shape])).type(torch.int32),
config.projection_size,
)
for inp in input_features
]
)
self.fc_stack = None
# todo future: this may be redundant, check
fc_layers = config.fc_layers
if fc_layers is None and config.num_fc_layers is not None:
fc_layers = []
for i in range(config.num_fc_layers):
fc_layers.append({"output_size": config.output_size})
self.fc_layers = fc_layers
if self.fc_layers is not None:
logger.debug(" FCStack")
self.fc_stack = FCStack(
first_layer_input_size=config.projection_size,
layers=config.fc_layers,
num_layers=config.num_fc_layers,
default_output_size=config.output_size,
default_use_bias=config.use_bias,
default_weights_initializer=config.weights_initializer,
default_bias_initializer=config.bias_initializer,
default_norm=config.norm,
default_norm_params=config.norm_params,
default_activation=config.activation,
default_dropout=config.dropout,
residual=config.residual,
)
if input_features and len(input_features) == 1 and self.fc_layers is None:
self.supports_masking = True
def forward(self, inputs: dict) -> dict: # encoder outputs
encoder_outputs = [inputs[k][ENCODER_OUTPUT] for k in inputs]
# ================ Flatten ================
batch_size = encoder_outputs[0].shape[0]
encoder_outputs = [torch.reshape(eo, [batch_size, -1]) for eo in encoder_outputs]
# ================ Project ================
projected = [self.projectors[i](eo) for i, eo in enumerate(encoder_outputs)]
hidden = torch.stack(projected)
hidden = torch.permute(hidden, (1, 0, 2)) # shape [bs, num_eo, h]
# ================ Aggregate ================
hidden = torch.mean(hidden, dim=1)
# ================ Fully Connected ================
if self.fc_stack is not None:
hidden = self.fc_stack(hidden)
return_data = {"combiner_output": hidden}
if len(inputs) == 1:
# Workaround for including additional tensors from output of input encoders for
# potential use in decoders, e.g. LSTM state for seq2seq.
# TODO(Justin): Think about how to make this communication work for multi-sequence
# features. Other combiners.
for key, value in [d for d in inputs.values()][0].items():
if key != ENCODER_OUTPUT:
return_data[key] = value
return return_data
================================================
FILE: ludwig/config_sampling/__init__.py
================================================
================================================
FILE: ludwig/config_sampling/explore_schema.py
================================================
import copy
import random
from collections import deque, namedtuple
from typing import Any, Deque
import pandas as pd
from ludwig.config_sampling.parameter_sampling import handle_property_type, ParameterBaseTypes
from ludwig.constants import SEQUENCE, TEXT, TIMESERIES
from ludwig.data.dataset_synthesizer import build_synthetic_dataset_df
from ludwig.schema.model_types.base import ModelConfig
from ludwig.types import ModelConfigDict
from ludwig.utils.misc_utils import merge_dict
# number of examples to generate for synthetic dataset
NUM_SYNTHETIC_EXAMPLES = 10
ConfigOption = namedtuple("ConfigOption", ["config_option", "fully_explored"])
def explore_properties(
jsonschema_properties: dict[str, Any],
parent_parameter_path: str,
dq: Deque[ConfigOption],
allow_list: list[str] = [],
) -> Deque[tuple[dict, bool]]:
"""Recursively explores the `properties` part of any subsection of the schema.
Args:
jsonschema_properties: any properties section of the schema.
parent_parameter_path: period-delimited list of parent dictionary keys up to the given jsonschema_properties
(e.g. defaults.number.preprocessing)
dq: dequeue data structure that stores tuples of (config_options, fully_explored).
config_options: Dict[str, List], fully_explored: bool is a dictionary is a dictionary of parameter name to
list of values to explore.
fully_explored is a boolean value indicating whether all subsections of the properties dictionary have been
explored.
allow_list: list of top level keys of the properties dictionary to skip.
Returns:
A deque of (dict, bool) tuples.
- The first element of the tuple contains a dictionary of config options, which maps from a ludwig
config parameter to a list of the values to be explored for that parameter. Here's an example:
trainer.batch_size: ["auto", 2, 43]
trainer.learning_rate: ["auto", 0.1, 0.00002, 0.32424]
...
- The second element of the tuple is whether we've explored this "config path"
fully. This is important to track when recursing into nested structures.
"""
# processed_dq will contain complete config options with all the parameters in the properties dictionary
# dq will contain configs options that are still being completed.
processed_dq = deque()
while dq and not dq[0].fully_explored:
for parameter_name_or_section, jsonschema_property in jsonschema_properties.items():
if allow_list and parameter_name_or_section not in allow_list:
continue
parameter_path = (
f"{parent_parameter_path}.{parameter_name_or_section}"
if parent_parameter_path
else parameter_name_or_section
)
config_options, _ = dq.popleft()
if "properties" in jsonschema_property and "allOf" in jsonschema_property:
for child_item in jsonschema_property["allOf"]:
expanded_config_options_dq = explore_from_all_of(
config_options=copy.deepcopy(config_options), item=child_item, key_so_far=parameter_path
)
# add returned child config options to the deque to be processed.
dq.extend(expanded_config_options_dq)
elif "properties" in jsonschema_property and "allOf" not in jsonschema_property:
# This is the case where we don't have a list of properties, just a properties
# dictionary nested inside another.
child_properties = jsonschema_property["properties"]
# a new dequeue to be passed to explore parameters from
raw_entry = deque([ConfigOption(copy.deepcopy(config_options), False)])
child_config_options_dq = explore_properties(child_properties, parameter_path, raw_entry)
merged_config_options_dq = merge_dq(config_options, child_config_options_dq)
# add returned config options to the deque to be processed.
dq.extend(merged_config_options_dq)
else:
# this is the base case.
parameter_samples = get_samples(jsonschema_property)
if parameter_samples:
config_options[parameter_path] = parameter_samples
# add config_options back to queue. fully_explored = False because we still didn't finish
# exploring all the keys in the properties dictionary.
dq.appendleft(ConfigOption(config_options, False))
# at this point, we finished exploring all keys of the properties dictionary. Add all config options
# to the processed queue.
while dq:
config_options, _ = dq.popleft()
processed_dq.append(ConfigOption(config_options, True))
return processed_dq
def get_samples(jsonschema_property: dict[str, Any]) -> list[ParameterBaseTypes]:
"""Get possible values for a leaf property (no sub-properties).
Args:
jsonschema_property: leaf property in the schema. Has no sub-properties.
"""
if "oneOf" in jsonschema_property:
temp = []
for elem in jsonschema_property["oneOf"]:
temp += get_potential_values(elem)
return temp
else:
return get_potential_values(jsonschema_property)
def merge_dq(config_options: dict[str, Any], child_config_options_dq: Deque[ConfigOption]) -> Deque[ConfigOption]:
"""Merge config_options with the child_config_options in the dq."""
dq = deque()
while child_config_options_dq:
child_config_options, visited = child_config_options_dq.popleft()
cfg = merge_dict(child_config_options, config_options)
dq.append(ConfigOption(cfg, visited))
return dq
def explore_from_all_of(config_options: dict[str, Any], item: dict[str, Any], key_so_far: str) -> Deque[ConfigOption]:
"""Takes a child of `allOf` and calls `explore_properties` on it."""
for parameter_name_or_section in item["if"]["properties"]:
config_options[key_so_far + "." + parameter_name_or_section] = item["if"]["properties"][
parameter_name_or_section
]["const"]
jsonschema_properties = item["then"]["properties"]
raw_entry = deque([ConfigOption(copy.deepcopy(config_options), False)])
return explore_properties(jsonschema_properties, parent_parameter_path=key_so_far, dq=raw_entry)
def get_potential_values(item: dict[str, Any]) -> list[ParameterBaseTypes | list[ParameterBaseTypes]]:
"""Returns a list of values to explore for a config parameter.
Param:
item: config parameter-specific dictionary. Considered as a leaf in the schema. Contains type, default, and
parameter metadata, etc.
"""
temp = []
item_type = item.get("type")
if item_type is None:
# No explicit type — try to infer from enum/const/default
if "enum" in item:
return [v for v in item["enum"] if v is not None]
if "const" in item:
return [item["const"]]
if "default" in item:
return [item["default"]]
return []
# Case where we're using OneOf (e.g. to allow batch size 'auto' and integers)
if isinstance(item_type, list):
for property_type in item_type:
temp += handle_property_type(property_type, item)
else:
temp += handle_property_type(item_type, item)
# Make sure values are unique. Not using set because some values are unhashable.
unique_temp = []
for temp_item in temp:
if temp_item not in unique_temp:
unique_temp.append(temp_item)
return unique_temp
def generate_possible_configs(config_options: dict[str, Any]):
"""Generate exhaustive configs from config_options.
This function does not take a cross product of all the options for all the config parameters. It selects parameter
values independently from each other.
Args:
config_options: dictionary mapping from ludwig config parameter to all values to be explored.
Here's an example of what it could look like:
trainer.batch_size: ["auto", 2, 43]
trainer.learning_rate: ["auto", 0.1, 0.00002, 0.32424]
...
"""
# The number of configs to generate is the max length of the lists of samples over all parameters.
num_configs = 1
for parameter_name in config_options:
if isinstance(config_options[parameter_name], list):
num_configs = max(num_configs, len(config_options[parameter_name]))
config_options[parameter_name] = deque(config_options[parameter_name])
for _ in range(num_configs):
config = {}
for parameter_name in config_options:
# if parameter is regular parameter with explored values.
if config_options[parameter_name] and not isinstance(config_options[parameter_name], str):
config[parameter_name] = config_options[parameter_name].popleft()
# case for parameters where we don't have choices such as `encoder.type: parallel_cnn` that
# cause the downstream parameters to change.
elif isinstance(config_options[parameter_name], str):
config[parameter_name] = config_options[parameter_name]
yield create_nested_dict(config)
def create_nested_dict(flat_dict: dict[str, float | str]) -> ModelConfigDict:
"""Generate a nested dict out of a flat dict whose keys are delimited by a delimiter character.
Args:
flat_dict: potential generated baseline config. Here's an example of what it could look like:
trainer.batch_size: 324
trainer.learning_rate: 0.0635
The expected output would be
trainer:
batch_size: 324
learning_rate: 0.0635
"""
def to_nested_format(parameter_name: str, value: str | int | float, delimiter: str = ".") -> dict[str, Any]:
# https://stackoverflow.com/a/40401961
split_parameter_name = parameter_name.split(delimiter)
for parameter_name_or_section in reversed(split_parameter_name):
value = {parameter_name_or_section: value}
return value
config = {}
for parameter_name_or_section in flat_dict:
config = merge_dict(
config, to_nested_format(parameter_name_or_section, copy.deepcopy(flat_dict[parameter_name_or_section]))
)
return config
def combine_configs(
explored: Deque[tuple[dict, bool]], config: ModelConfigDict
) -> list[tuple[ModelConfigDict, pd.DataFrame]]:
"""Merge base config with explored sections.
Args:
explored: deque containing all the config options.
config: base Ludwig config to merge the explored configs with.
"""
dataset = build_synthetic_dataset_df(NUM_SYNTHETIC_EXAMPLES, config)
ret = []
for config_options, _ in explored:
for default_config in generate_possible_configs(config_options=config_options):
merged_config = merge_dict(copy.deepcopy(config), default_config)
try:
ModelConfig.from_dict(merged_config)
ret.append((merged_config, dataset))
except Exception:
pass
return ret
def combine_configs_for_comparator_combiner(
explored: Deque[tuple], config: ModelConfigDict
) -> list[tuple[ModelConfigDict, pd.DataFrame]]:
"""Merge base config with explored sections.
Completes the entity_1 and entity_2 paramters of the comparator combiner.
Args:
explored: deque containing all the config options.
config: base Ludwig config to merge the explored configs with.
"""
dataset = build_synthetic_dataset_df(NUM_SYNTHETIC_EXAMPLES, config)
ret = []
for item in explored:
for default_config in generate_possible_configs(config_options=item[0]):
merged_config = merge_dict(copy.deepcopy(config), default_config)
# create two random lists for entity1 and entity2
entity_names = [feature["name"] for feature in config["input_features"]]
random.shuffle(entity_names)
entity_1_size = random.randint(1, len(entity_names) - 1)
merged_config["combiner"]["entity_1"] = entity_names[:entity_1_size]
merged_config["combiner"]["entity_2"] = entity_names[entity_1_size:]
try:
ModelConfig.from_dict(merged_config)
ret.append((merged_config, dataset))
except Exception:
pass
return ret
def combine_configs_for_sequence_combiner(
explored: Deque[tuple], config: ModelConfigDict
) -> list[tuple[ModelConfigDict, pd.DataFrame]]:
"""Merge base config with explored sections.
Uses the right reduce_output strategy for the sequence and sequence_concat combiners.
Args:
explored: deque containing all the config options.
config: base Ludwig config to merge the explored configs with.
"""
dataset = build_synthetic_dataset_df(NUM_SYNTHETIC_EXAMPLES, config)
ret = []
for item in explored:
for default_config in generate_possible_configs(config_options=item[0]):
merged_config = merge_dict(copy.deepcopy(config), default_config)
for i in range(len(merged_config["input_features"])):
if merged_config["input_features"][i]["type"] in {SEQUENCE, TEXT, TIMESERIES}:
merged_config["input_features"][0]["encoder"] = {"type": "embed", "reduce_output": None}
try:
ModelConfig.from_dict(merged_config)
ret.append((merged_config, dataset))
except Exception:
pass
return ret
================================================
FILE: ludwig/config_sampling/parameter_sampling.py
================================================
import random
from typing import Any, Union
from ludwig.schema.metadata.parameter_metadata import ExpectedImpact
# base types for ludwig config parameters.
ParameterBaseTypes = Union[str, float, int, bool, None]
def handle_property_type(
property_type: str, item: dict[str, Any], expected_impact: ExpectedImpact = ExpectedImpact.HIGH
) -> list[ParameterBaseTypes | list[ParameterBaseTypes]]:
"""Return possible parameter values for a parameter type.
Args:
property_type: type of the parameter (e.g. array, number, etc.)
item: dictionary containing details on the parameter such as default, min and max values.
expected_impact: threshold expected impact that we'd like to include.
"""
parameter_metadata = item.get("parameter_metadata", None)
if not parameter_metadata:
return []
# don't explore internal only parameters.
if parameter_metadata.get("internal_only", True):
return []
# don't explore parameters that have expected impact less than HIGH.
if parameter_metadata.get("expected_impact", ExpectedImpact.LOW) < expected_impact:
return []
if property_type == "number":
return explore_number(item)
elif property_type == "integer":
return explore_integer(item)
elif property_type == "string":
return explore_string(item)
elif property_type == "boolean":
return explore_boolean()
elif property_type == "null":
return explore_null()
elif property_type == "array":
return explore_array(item)
else:
return []
def explore_array(item: dict[str, Any]) -> list[list[ParameterBaseTypes]]:
"""Return possible parameter values for the `array` parameter type.
Args:
item: dictionary containing details on the parameter such as default, min and max values.
"""
candidates = []
if "default" in item and item["default"]:
candidates.append(item["default"])
item_choices = []
maxlen = 0
# In the case where the length of the array isn't defined.
if not isinstance(item["items"], list):
return []
for item_of in item["items"]:
choices = handle_property_type(item_of["type"], item_of)
maxlen = max(maxlen, len(choices))
item_choices.append(choices)
# pad to same length
for i in range(len(item_choices)):
item_choices[i] = maxlen * item_choices[i]
item_choices[i] = item_choices[i][:maxlen]
merged = list(zip(*item_choices)) + candidates
return [list(tup) for tup in merged]
def explore_number(item: dict[str, Any]) -> list[ParameterBaseTypes]:
"""Return possible parameter values for the `number` parameter type.
Args:
item: dictionary containing details on the parameter such as default, min and max values.
TODO(Wael): Improve logic.
"""
minimum, maximum = 0, 1
if "default" not in item or item["default"] is None:
candidates = []
else:
candidates = [1, 2, item["default"], 2 * (item["default"] + 1), item["default"] // 2, -1 * item["default"]]
if "minimum" in item:
minimum = item["minimum"]
candidates = [num for num in candidates if num > minimum]
if "maximum" in item:
maximum = item["maximum"]
candidates = [num for num in candidates if num < maximum]
return candidates + [random.random() * 0.99 * maximum]
def explore_integer(item: dict[str, Any]) -> list[ParameterBaseTypes]:
"""Return possible parameter values for the `integer` parameter type.
Args:
item: dictionary containing details on the parameter such as default, min and max values.
TODO(Wael): Improve logic.
"""
minimum, maximum = 0, 10
if "default" not in item or item["default"] is None:
candidates = []
else:
candidates = [item["default"], 2 * (item["default"] + 1), item["default"] // 2, -1 * item["default"]]
if "minimum" in item:
minimum = item["minimum"]
candidates = [num for num in candidates if num >= item["minimum"]]
if "maximum" in item:
maximum = item["maximum"]
candidates = [num for num in candidates if num <= item["maximum"]]
return candidates + [random.randint(minimum, maximum)]
def explore_string(item: dict[str, Any]) -> list[ParameterBaseTypes]:
"""Return possible parameter values for the `string` parameter type.
Args:
item: dictionary containing details on the parameter such as default, min and max values.
"""
if "enum" in item:
return item["enum"]
return [item["default"]]
def explore_boolean() -> list[bool]:
"""Return possible parameter values for the `boolean` parameter type (i.e. [True, False])"""
return [True, False]
def explore_null() -> list[None]:
"""Return possible parameter values for the `null` parameter type (i.e. [None])"""
return [None]
================================================
FILE: ludwig/config_validation/__init__.py
================================================
================================================
FILE: ludwig/config_validation/checks.py
================================================
"""Checks that are not easily covered by marshmallow JSON schema validation like parameter interdependencies."""
from abc import ABC, abstractmethod
from collections.abc import Callable
from re import findall
from typing import TYPE_CHECKING
from transformers import AutoConfig
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
AUDIO,
BINARY,
IMAGE,
IN_MEMORY,
MIN_QUANTIZATION_BITS_FOR_MERGE_AND_UNLOAD,
MODEL_ECD,
MODEL_LLM,
SEQUENCE,
SET,
TEXT,
TIMESERIES,
VECTOR,
)
from ludwig.error import ConfigValidationError
from ludwig.utils.metric_utils import get_feature_to_metric_names_map_from_feature_collection
from ludwig.utils.misc_utils import merge_dict
if TYPE_CHECKING:
from ludwig.schema.model_config import ModelConfig
# Set of all sequence feature types.
SEQUENCE_OUTPUT_FEATURE_TYPES = {SEQUENCE, TEXT, SET, VECTOR}
class ConfigCheckRegistry:
"""A registry of configuration checks."""
def __init__(self):
self._registry = []
def register(self, check_fn):
self._registry.append(check_fn)
def check_config(self, config: "ModelConfig") -> None: # noqa: F821
for check_fn in self._registry:
check_fn(config)
_CONFIG_CHECK_REGISTRY = ConfigCheckRegistry()
def get_config_check_registry():
"""Returns the config check registry."""
return _CONFIG_CHECK_REGISTRY
@DeveloperAPI
def register_config_check(fn) -> Callable:
"""Registers a config check function."""
_CONFIG_CHECK_REGISTRY.register(fn)
class ConfigCheck(ABC):
"""Checks instances of comprehensive (all parameters and defaults filled in) schema-validated config."""
@staticmethod
@abstractmethod
def check(config: "ModelConfig") -> None: # noqa: F821
"""Checks config for validity."""
raise NotImplementedError
@register_config_check
def check_feature_names_unique(config: "ModelConfig") -> None: # noqa: F821
"""Checks that all feature names are unique."""
input_features = config.input_features
input_feature_names = {input_feature.name for input_feature in input_features}
output_features = config.output_features
output_feature_names = {output_feature.name for output_feature in output_features}
if len(input_feature_names) + len(output_feature_names) != len(input_features) + len(output_features):
raise ConfigValidationError("Feature names must be unique.")
@register_config_check
def check_tied_features_valid(config: "ModelConfig") -> None: # noqa: F821
"""Checks that all tied features are valid."""
input_features = config.input_features
input_feature_names = {input_feature.name for input_feature in input_features}
for input_feature in input_features:
if input_feature.tied and input_feature.tied not in input_feature_names:
raise ConfigValidationError(
f"Feature {input_feature.name} is tied to feature {input_feature.tied}, but the "
f"'{input_feature.tied}' feature does not exist."
)
@register_config_check
def check_training_runway(config: "ModelConfig") -> None: # noqa: F821
"""Checks that checkpoints_per_epoch and steps_per_checkpoint aren't simultaneously defined."""
if config.model_type == MODEL_ECD:
if config.trainer.checkpoints_per_epoch != 0 and config.trainer.steps_per_checkpoint != 0:
raise ConfigValidationError(
"It is invalid to specify both trainer.checkpoints_per_epoch AND "
"trainer.steps_per_checkpoint. Please specify one or the other, or specify neither to "
"checkpoint/eval the model every epoch."
)
@register_config_check
def check_ray_backend_in_memory_preprocessing(config: "ModelConfig") -> None: # noqa: F821
"""Checks that in memory preprocessing is used with Ray backend."""
if config.backend is None:
return
if not hasattr(config.trainer, "preprocessing") or not hasattr(config.trainer.preprocessing, IN_MEMORY):
return
if config.backend.type == "ray" and not config.trainer.preprocessing.in_memory:
raise ConfigValidationError(
"RayBackend does not support lazy loading of data files at train time. "
"Set preprocessing config `in_memory: True`"
)
for input_feature in config.input_features:
if input_feature.type == AUDIO or input_feature.type == IMAGE:
if not input_feature.preprocessing.in_memory and config.backend.type != "ray":
raise ConfigValidationError(
"RayBackend does not support lazy loading of data files at train time. "
f"Set preprocessing config `in_memory: True` for input feature {input_feature.name}"
)
def check_sequence_concat_combiner_requirements(config: "ModelConfig") -> None: # noqa: F821
"""Checks that sequence concat combiner has at least one input feature that's sequential."""
if config.model_type != MODEL_ECD:
return
if config.combiner != "sequence_concat":
return
has_sequence_input = False
for input_feature in config.input_features:
if input_feature.type in SEQUENCE_OUTPUT_FEATURE_TYPES:
has_sequence_input = True
break
if not has_sequence_input:
raise ConfigValidationError(
"Sequence concat combiner should only be used for at least one sequential input feature."
)
@register_config_check
def check_comparator_combiner_requirements(config: "ModelConfig") -> None: # noqa: F821
"""Checks that all of the feature names for entity_1 and entity_2 are valid features."""
if config.model_type != MODEL_ECD:
return
if config.combiner.type != "comparator":
return
input_feature_names = [input_feature.name for input_feature in config.input_features]
for feature_name in config.combiner.entity_1:
if feature_name not in input_feature_names:
raise ConfigValidationError(
f"Feature {feature_name} in entity_1 for the comparator combiner is not a valid " "input feature name."
)
for feature_name in config.combiner.entity_2:
if feature_name not in input_feature_names:
raise ConfigValidationError(
f"Feature {feature_name} in entity_2 for the comparator combiner is not a valid " "input feature name."
)
if sorted(config.combiner.entity_1 + config.combiner.entity_2) != sorted(input_feature_names):
raise ConfigValidationError("Not all input features are present as entities in the comparator combiner.")
@register_config_check
def check_class_balance_preprocessing(config: "ModelConfig") -> None: # noqa: F821
"""Class balancing is only available for datasets with a single output feature."""
if config.preprocessing.oversample_minority or config.preprocessing.undersample_majority:
if len(config.output_features) != 1:
raise ConfigValidationError("Class balancing is only available for datasets with a single output feature.")
if config.output_features[0].type != BINARY:
raise ConfigValidationError("Class balancing is only supported for binary output features.")
@register_config_check
def check_sampling_exclusivity(config: "ModelConfig") -> None: # noqa: F821
"""Oversample minority and undersample majority are mutually exclusive."""
if config.preprocessing.oversample_minority and config.preprocessing.undersample_majority:
raise ConfigValidationError(
"Oversample minority and undersample majority are mutually exclusive. Specify only one method."
)
@register_config_check
def check_validation_metric_exists(config: "ModelConfig") -> None: # noqa: F821
"""Checks that the specified validation metric exists."""
validation_metric_name = config.trainer.validation_metric
# Get all valid metrics.
feature_to_metric_names_map = get_feature_to_metric_names_map_from_feature_collection(config.output_features)
all_valid_metrics = set()
for metric_names in feature_to_metric_names_map.values():
all_valid_metrics.update(metric_names)
if validation_metric_name not in all_valid_metrics:
raise ConfigValidationError(
f"User-specified trainer.validation_metric '{validation_metric_name}' is not valid. "
f"Available metrics are: {all_valid_metrics}"
)
@register_config_check
def check_splitter(config: "ModelConfig") -> None: # noqa: F821
"""Checks the validity of the splitter configuration."""
from ludwig.data.split import get_splitter
splitter = get_splitter(**config.preprocessing.split.to_dict())
splitter.validate(config)
@register_config_check
def check_hf_tokenizer_requirements(config: "ModelConfig") -> None: # noqa: F821
"""Checks that the HuggingFace tokenizer has a pretrained_model_name_or_path specified."""
for input_feature in config.input_features:
if input_feature.type == TEXT:
if input_feature.preprocessing.tokenizer == "hf_tokenizer":
if input_feature.preprocessing.pretrained_model_name_or_path is None:
raise ConfigValidationError(
"Pretrained model name or path must be specified for HuggingFace tokenizer."
)
@register_config_check
def check_hf_encoder_requirements(config: "ModelConfig") -> None: # noqa: F821
"""Checks that a HuggingFace encoder has a pretrained_model_name_or_path specified."""
for input_feature in config.input_features:
if input_feature.type == TEXT:
if hasattr(input_feature.encoder, "use_pretrained"):
if input_feature.preprocessing.pretrained_model_name_or_path is None:
raise ConfigValidationError(
"Pretrained model name or path must be specified for HuggingFace encoder."
)
@register_config_check
def check_stacked_transformer_requirements(config: "ModelConfig") -> None: # noqa: F821
"""Checks that the transformer encoder type correctly configures `num_heads` and `hidden_size`"""
def is_divisible(hidden_size: int, num_heads: int) -> bool:
"""Checks that hidden_size is divisible by num_heads."""
return hidden_size % num_heads == 0
sequence_types = [SEQUENCE, TEXT, TIMESERIES]
for input_feature in config.input_features:
if_type = input_feature.type
encoder = input_feature.encoder
if (
if_type in sequence_types
and encoder.type == "transformer"
and not is_divisible(encoder.hidden_size, encoder.num_heads)
):
raise ConfigValidationError(
f"Input feature {input_feature.name} transformer encoder requires encoder.hidden_size to be divisible "
f"by encoder.num_heads. Found hidden_size {encoder.hidden_size} and num_heads {encoder.num_heads}."
)
@register_config_check
def check_hyperopt_search_algorithm_dependencies_installed(config: "ModelConfig") -> None: # noqa: F821
"""Check that the hyperopt search algorithm dependencies are installed."""
if config.hyperopt is None:
return
try:
config.hyperopt.search_alg.dependencies_installed()
except ImportError as e:
raise ConfigValidationError(e.msg)
@register_config_check
def check_hyperopt_scheduler_dependencies_installed(config: "ModelConfig") -> None: # noqa: F821
"""Check that the hyperopt scheduler dependencies are installed."""
if config.hyperopt is None:
return
try:
config.hyperopt.executor.scheduler.dependencies_installed()
except ImportError as e:
raise ConfigValidationError(e.msg)
@register_config_check
def check_tagger_decoder_requirements(config: "ModelConfig") -> None: # noqa: F821
"""Checks that the tagger decoder has at least one sequence, text or timeseries input feature where the
encoder's reduce_output will produce a 3D shaped output from the combiner."""
# Check if there is a text or sequence output feature using a tagger decoder
output_feature_with_tagger_decoder = False
for output_feature in config.output_features:
if output_feature.type in {TEXT, SEQUENCE} and output_feature.decoder.type == "tagger":
output_feature_with_tagger_decoder = True
if not output_feature_with_tagger_decoder:
return
# Check that there is at least one sequence, text or timeseries input feature that doesn't reduce the
# output of the encoder.
has_sequence_feature = False
for input_feature in config.input_features:
if input_feature.type in {SEQUENCE, TEXT, TIMESERIES}:
has_sequence_feature = True
if input_feature.encoder.reduce_output is None:
return
if not has_sequence_feature:
raise ConfigValidationError("Tagger decoder requires at least one text, sequence or timeseries input feature.")
else:
raise ConfigValidationError(
"Tagger decoder requires at least one of the text, sequence or timeseries input feature encoders to have "
"`reduce_output` set to `None`."
)
@register_config_check
def check_hyperopt_parameter_dicts(config: "ModelConfig") -> None: # noqa: F821
"""Checks for hyperopt parameter dicts against their config objects."""
if config.hyperopt is None:
return
from ludwig.schema.hyperopt.utils import get_parameter_cls, parameter_config_registry # noqa: F401
for parameter, space in config.hyperopt.parameters.items():
# skip nested hyperopt parameters
if parameter != ".":
parameter_attribute_path = parameter.split(".")
passed = False
for root in [config, config.input_features, config.output_features]:
current = root
for p in parameter_attribute_path:
try:
current = current.__getattribute__(p)
if p == parameter_attribute_path[-1]:
passed = True
except AttributeError:
break
if passed:
break
if not passed:
raise ConfigValidationError(
f"The supplied hyperopt parameter {parameter} is not a valid config field. Check the Ludwig "
"docs for the list of valid parameters."
)
try:
space_cls = get_parameter_cls(space["space"])
space_cls.from_dict(space)
except KeyError:
space_types = ", ".join(parameter_config_registry.keys())
raise ConfigValidationError(
f"Invalid hyperopt parameter space requested for `hyperopt.parameters.{parameter}`. Valid spaces "
f"are {space_types}."
)
@register_config_check
def check_concat_combiner_requirements(config: "ModelConfig") -> None: # noqa: F821
"""Checks that if the concat combiner receives a mixture of sequence and non-sequence features, that all
sequence features are configured with reduce_output to be 2D tensors."""
if config.model_type != MODEL_ECD:
return
if config.combiner.type != "concat":
return
has_unreduced_sequence_feature = False
has_non_sequence_feature = False
for input_feature in config.input_features:
if (
input_feature.type in {SEQUENCE, TEXT, TIMESERIES}
and hasattr(input_feature.encoder, "reduce_output")
and input_feature.encoder.reduce_output is None
):
has_unreduced_sequence_feature = True
else:
has_non_sequence_feature = True
if has_unreduced_sequence_feature and has_non_sequence_feature:
raise ConfigValidationError(
"The concat combiner cannot receive a mix of unreduced sequence features (3D) and non-sequence features "
"(2D). Options: 1) Set reduce_output in sequence feature encoders to a value other than None to ensure 2D "
"encoder outputs, 2) Choose a different combiner like `sequence_concat` which can handle a mix of 2D and "
"3D encoder output shapes, or 3) Remove features to ensure that output shapes from all encoders are the "
"same dimension (all 2D or all 3D)."
)
@register_config_check
def check_hyperopt_nested_parameter_dicts(config: "ModelConfig") -> None: # noqa: F821
"""Checks that all nested parameters in a hyperopt config exist."""
if config.hyperopt is None or "." not in config.hyperopt.parameters:
return
from ludwig.schema.hyperopt.utils import get_parameter_cls # noqa: F401
from ludwig.schema.model_types.base import ModelConfig
space = config.hyperopt.parameters["."]
# Build the config that would be produced by each parameter dict to validate subsections that may be in
config_dict = config.to_dict()
del config_dict["hyperopt"]
for category in space["categories"]:
for i, k in enumerate(category.keys()):
try:
config.__getattribute__(k)
except AttributeError:
raise ConfigValidationError(f"Invalid config block {k} in nested hyperopt parameter dict {i}: {space}.")
category_dict = merge_dict(config_dict, category)
try:
ModelConfig.from_dict(category_dict)
except ConfigValidationError as e:
raise ConfigValidationError(f"Invalid config in hyperopt nested parameter config: {category}. {e.message}")
try:
space_cls = get_parameter_cls("choice")
space_cls.from_dict(space)
except KeyError:
raise ConfigValidationError(
f"Nested hyperparameter search spaces must be of type 'choice'. Requested space type: {space['space']}"
)
@register_config_check
def check_llm_exactly_one_input_text_feature(config: "ModelConfig"): # noqa: F821
if config.model_type != MODEL_LLM:
return
if len(config.input_features) == 1 and config.input_features[0].type == TEXT:
return
else:
raise ConfigValidationError("LLM requires exactly one text input feature.")
@register_config_check
def check_llm_finetuning_output_feature_config(config: "ModelConfig"): # noqa: F821
"""Checks that the output feature config for LLM finetuning is valid."""
if config.model_type != MODEL_LLM:
return
if config.trainer.type != "finetune":
return
if config.output_features[0].type != TEXT:
raise ConfigValidationError(
"LLM finetuning requires the output feature to be a text feature. If you are trying to use a different "
"output feature type such as category or binary, please change the output feature type to text."
)
@register_config_check
def check_llm_finetuning_trainer_config(config: "ModelConfig"): # noqa: F821
"""Ensures that trainer type is finetune if adapter is not None."""
if config.model_type != MODEL_LLM:
return
if (
config.trainer.type == "none"
and config.adapter is not None
and config.adapter.pretrained_adapter_weights is not None
):
# If performing zero-shot, we must specify pretrained adapter weights
return
if config.adapter is not None and config.trainer.type != "finetune":
raise ConfigValidationError("LLM finetuning requires trainer type to be finetune.")
@register_config_check
def check_llm_finetuning_backend_config(config: "ModelConfig"): # noqa: F821
"""Checks that the LLM finetuning using Ray is configured correctly.
DDP strategy is not supported for LLM finetuning because it leads to OOMs since the model is large and DDP strategy
requires a copy of the model on each GPU.
"""
if config.model_type != MODEL_LLM:
return
# LLM finetuning is only supported by the finetune trainer type
if (
config.trainer.type != "finetune"
and config.adapter is not None
and config.adapter.pretrained_adapter_weights is not None
):
return
# Using local backend, so skip the checks below
if not hasattr(config.backend, "type"):
return
backend = config.backend
if not hasattr(backend.trainer, "strategy") or backend.trainer.strategy != "deepspeed":
raise ConfigValidationError("LLM finetuning with Ray requires the DeepSpeed strategy.")
# Deepspeed requires GPU
if not backend.trainer.use_gpu or backend.trainer.resources_per_worker.GPU < 1:
raise ConfigValidationError("LLM finetuning with DeepSpeed requires GPU.")
@register_config_check
def check_llm_finetuning_adalora_config(config: "ModelConfig"):
"""Checks that the adalora adapter is configured correctly.
We check against PEFT's predefined target module list for ADALORA to see if this target_modules is present there. If
not, AdaloraModel will run into issues downstream.
"""
if config.model_type != MODEL_LLM:
return
if not config.adapter:
return
if config.adapter.type != "adalora":
return
from peft.utils import TRANSFORMERS_MODELS_TO_ADALORA_TARGET_MODULES_MAPPING
model_config = _get_llm_model_config(config.base_model)
if model_config.model_type not in TRANSFORMERS_MODELS_TO_ADALORA_TARGET_MODULES_MAPPING:
raise ConfigValidationError(
f"Adalora adapter is not supported for {model_config.model_type} model. "
f"Supported model types are: {list(TRANSFORMERS_MODELS_TO_ADALORA_TARGET_MODULES_MAPPING.keys())}. "
"If you know the target modules for your model, please specify them in the config through the "
"`target_modules` key."
)
@register_config_check
def check_llm_finetuning_adaption_prompt_parameters(config: "ModelConfig"):
"""Checks that the adaption_prompt adapter is configured correctly.
Adaption prompt is only supported for Llama models.
"""
if config.model_type != MODEL_LLM:
return
if not config.adapter:
return
if config.adapter.type != "adaption_prompt":
return
from peft.tuners.adaption_prompt.config import TRANSFORMERS_MODEL_CONFIG
# Adaption Config is currently only supported for Llama model types
model_config = _get_llm_model_config(config.base_model)
if model_config.model_type not in TRANSFORMERS_MODEL_CONFIG:
raise ConfigValidationError(
f"Adaption prompt adapter is not supported for {model_config.model_type} model. "
f"Supported model types are: {list(TRANSFORMERS_MODEL_CONFIG.keys())}."
)
def _get_llm_model_config(model_name: str) -> AutoConfig:
"""Returns the LLM model config."""
return AutoConfig.from_pretrained(model_name)
# TODO(geoffrey, arnav): uncomment this when we have reconciled the config with the backend kwarg in api.py
# @register_config_check
def check_llm_quantization_backend_incompatibility(config: "ModelConfig") -> None: # noqa: F821
"""Checks that LLM model type with quantization uses the local backend."""
if config.model_type != MODEL_LLM:
return
if config.quantization is None:
return
backend_type = None
if config.backend:
backend_type = config.backend.get("type", None)
# If backend was explicitly set to Ray, then we need to raise an error
if backend_type == "ray":
raise ConfigValidationError(f"LLM with quantization requires the 'local' backend, found: '{backend_type}'")
# If the backend is not explicitly set, then we need to check if a Ray process is running
# If a Ray process is running, then we need to raise an error because the backend will be set to Ray
if config.backend is None:
try:
# May not be installed, so we need to catch the ImportError
import ray
if ray.is_initialized():
raise ConfigValidationError(
"LLM with quantization requires the 'local' backend, but backend will be set "
"to Ray since Ray is already running locally."
)
except ImportError:
pass
@register_config_check
def check_llm_text_encoder_is_not_used_with_ecd(config: "ModelConfig") -> None:
"""Checks that a pretrained text encoder is not used for ECD models with a text output feature."""
if config.model_type != MODEL_ECD:
return
if config.input_features[0].type != TEXT:
return
if config.output_features[0].type != TEXT:
return
if (
hasattr(config.input_features[0].encoder, "pretrained_model_name_or_path")
and config.input_features[0].encoder.pretrained_model_name_or_path
):
raise ConfigValidationError("Please use the `model_type: llm` for text-to-text models.")
@register_config_check
def check_qlora_requirements(config: "ModelConfig") -> None: # noqa: F821
"""Checks that all the necessary settings are in place for QLoRA."""
if config.model_type != MODEL_LLM or config.trainer.type == "none":
return
if config.quantization and (not config.adapter or config.adapter.type != "lora"):
raise ConfigValidationError("Fine-tuning and LLM with quantization requires using the 'lora' adapter")
@register_config_check
def check_qlora_merge_and_unload_compatibility(config: "ModelConfig") -> None: # noqa: F821
"""Checks that model.merge_and_unload() is supported by underlying model.save_pretrained() when merging QLoRA
layers."""
if config.model_type != MODEL_LLM or config.trainer.type == "none":
return
if not (
config.adapter
and config.adapter.type in ["lora", "adalora"]
and config.adapter.postprocessor
and config.adapter.postprocessor.merge_adapter_into_base_model
and config.quantization
):
return
if config.quantization.bits < MIN_QUANTIZATION_BITS_FOR_MERGE_AND_UNLOAD:
raise ConfigValidationError(
f"""This operation will entail merging LoRA layers on a {config.quantization.bits}-bit \
quantized model. Calling "save_pretrained()" on that model is currently unsupported. If you want to merge the LoRA \
adapter weights into the base model, you need to use 8-bit quantization or do non-quantized based training by removing \
the quantization section from your Ludwig configuration."""
)
@register_config_check
def check_prompt_requirements(config: "ModelConfig") -> None: # noqa: F821
"""Checks that prompt's template and task properties are valid, according to the description on the schema."""
if config.model_type != MODEL_LLM:
return
# TODO: `prompt` by default should be set to null, not a default dict:
# # If no prompt is provided, no validation necessary:
# if not config.prompt:
# return
from ludwig.schema.llms.prompt import PromptConfig, RetrievalConfig
if config.prompt == PromptConfig():
return
template = config.prompt.template
task = config.prompt.task
retrieval = config.prompt.retrieval
# If template is NOT provided, then task is required for zero/few shot learning:
if not template and not task:
raise ConfigValidationError("A prompt task is required if no template is provided!")
template_refs = set(findall(r"\{(.*?)\}", template)) if isinstance(template, str) else set()
# If a template IS provided (i.e. we are not doing a built-in zero/few-shot learning), then...
if template:
# If task is also provided, the template must contain it:
if task and "__task__" not in template_refs:
raise ConfigValidationError(
"When providing a task, you must make sure that the task keyword `{__task__} is "
"present somewhere in the template string!"
)
# If retrieval is also provided, the template must reference it:
# TODO: retrieval by default should be set to null, not a default dict:
if retrieval and retrieval != RetrievalConfig() and "__context__" not in template_refs:
raise ConfigValidationError(
"When providing a retrieval config, you must make sure that the task keyword `{__context__}` is "
"present somewhere in the template string!"
)
# Otherwise, the template should at least contain the sample keyword or some input column:
# TODO: len(template_refs) is a hacky attempt to check that there are references to *something* in the
# string. The proper validation is to check the references against the features in the user's dataset - but we
# do not have access to the dataset in this code path right now.
if not task:
if len(template_refs) == 0 and "__sample__" not in template_refs:
raise ConfigValidationError(
"A template must contain at least one reference to a column or the sample keyword {__sample__} for "
"a JSON-serialized representation of non-output feature columns."
)
# Raise an error if template has a placeholder for the output feature name (column).
output_feature_col = config.output_features[0].column
if output_feature_col in template_refs:
raise ConfigValidationError(
"Prompt template should not have a reference to the output feature. The output feature is "
"automatically added to the end of the prompt template merged with the input at training time."
)
@register_config_check
def check_sample_ratio_and_size_compatible(config: "ModelConfig") -> None:
sample_ratio = config.preprocessing.sample_ratio
sample_size = config.preprocessing.sample_size
if sample_size is not None and sample_ratio < 1.0:
raise ConfigValidationError("sample_size cannot be used when sample_ratio < 1.0")
================================================
FILE: ludwig/config_validation/preprocessing.py
================================================
def check_global_max_sequence_length_fits_prompt_template(metadata, global_preprocessing_parameters):
"""Checks that the prompt template fits within the global max sequence length."""
if (
"global_max_sequence_length" in global_preprocessing_parameters
and global_preprocessing_parameters["global_max_sequence_length"] is not None
):
for feature_name, feature_metadata in metadata.items():
if (
"prompt_template_num_tokens" in feature_metadata
and feature_metadata["prompt_template_num_tokens"]
> global_preprocessing_parameters["global_max_sequence_length"]
):
raise ValueError(
f'The prompt contains ({feature_metadata["prompt_template_num_tokens"]}) tokens, which is more '
f"than the the global_max_sequence_length "
f'({global_preprocessing_parameters["global_max_sequence_length"]}), which will remove all unique '
"information. Shorten the prompt, or increase the global max sequence length to > "
f'({feature_metadata["prompt_template_num_tokens"]}) to include the full prompt.'
)
================================================
FILE: ludwig/config_validation/validation.py
================================================
from functools import lru_cache
from threading import Lock
import jsonschema.exceptions
from jsonschema import Draft7Validator, validate
from jsonschema.validators import extend
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BASE_MODEL, MODEL_ECD, MODEL_LLM, MODEL_TYPE
from ludwig.error import ConfigValidationError
# TODO(travis): figure out why we need these imports to avoid circular import error
from ludwig.schema.combiners.utils import get_combiner_jsonschema # noqa
from ludwig.schema.features.utils import get_input_feature_jsonschema, get_output_feature_jsonschema # noqa
from ludwig.schema.hyperopt import get_hyperopt_jsonschema # noqa
from ludwig.schema.trainer import get_model_type_jsonschema, get_trainer_jsonschema # noqa
from ludwig.schema.utils import unload_jsonschema_from_marshmallow_class
VALIDATION_LOCK = Lock()
@DeveloperAPI
@lru_cache(maxsize=3)
def get_schema(model_type: str = MODEL_ECD):
# Force populate combiner registry:
import ludwig.combiners.combiners # noqa: F401
from ludwig.schema.model_types.base import model_type_schema_registry
cls = model_type_schema_registry[model_type]
props = unload_jsonschema_from_marshmallow_class(cls)["properties"]
# TODO: Replace with more robust required logic later.
required = ["input_features", "output_features"]
if model_type == MODEL_LLM:
required += [BASE_MODEL]
return {
"type": "object",
"properties": props,
"title": "model_options",
"description": "Settings for Ludwig configuration",
"required": required,
"additionalProperties": True,
}
@lru_cache(maxsize=1)
def get_validator():
# Manually add support for tuples (pending upstream changes: https://github.com/Julian/jsonschema/issues/148):
def custom_is_array(checker, instance):
return isinstance(instance, list) or isinstance(instance, tuple)
# This creates a new class, so cache to prevent a memory leak:
# https://github.com/python-jsonschema/jsonschema/issues/868
type_checker = Draft7Validator.TYPE_CHECKER.redefine("array", custom_is_array)
return extend(Draft7Validator, type_checker=type_checker)
@DeveloperAPI
def check_schema(updated_config):
"""Emulates the pure JSONSchema validation that could be used in an environment without marshmallow.
The incoming config may not be comprehensive, but is assumed to be up to date with the latest ludwig schema.
"""
model_type = updated_config.get(MODEL_TYPE, MODEL_ECD)
error = None
with VALIDATION_LOCK:
try:
validate(instance=updated_config, schema=get_schema(model_type=model_type), cls=get_validator())
except jsonschema.exceptions.ValidationError as e:
# Capture error but don't raise here, otherwise we get the full output from `e`, which contains a dump
# of the entire schema
error = e
if error is not None:
raise ConfigValidationError(f"Failed to validate JSON schema for config. Error: {error.message}") from error
================================================
FILE: ludwig/constants.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
INPUT_FEATURES = "input_features"
OUTPUT_FEATURES = "output_features"
INPUT = "input"
OUTPUT = "output"
BINARY = "binary"
CATEGORY = "category"
CATEGORY_DISTRIBUTION = "category_distribution"
INT = "int"
FLOAT = "float"
SPACE = "space"
NUMBER = "number"
SET = "set"
BAG = "bag"
TEXT = "text"
SEQUENCE = "sequence"
TIMESERIES = "timeseries"
IMAGE = "image"
AUDIO = "audio"
DATE = "date"
H3 = "h3"
VECTOR = "vector"
HEIGHT = "height"
WIDTH = "width"
INFER_IMAGE_DIMENSIONS = "infer_image_dimensions"
INFER_IMAGE_MAX_HEIGHT = "infer_image_max_height"
INFER_IMAGE_MAX_WIDTH = "infer_image_max_width"
INFER_IMAGE_SAMPLE_SIZE = "infer_image_sample_size"
INFER_IMAGE_NUM_CLASSES = "infer_image_num_classes"
IMAGE_MAX_CLASSES = 128
NUM_CLASSES = "num_classes"
NUM_CHANNELS = "num_channels"
REQUIRES_EQUAL_DIMENSIONS = "requires_equal_dimensions"
USE_PRETRAINED = "use_pretrained"
TRAINABLE = "trainable"
CLASS_WEIGHTS = "class_weights"
USED_TOKENS = "used_tokens"
LOSS = "loss"
ROC_AUC = "roc_auc"
EVAL_LOSS = "eval_loss"
TRAIN_MEAN_LOSS = "train_mean_loss"
SEQUENCE_SOFTMAX_CROSS_ENTROPY = "sequence_softmax_cross_entropy"
NEXT_TOKEN_SOFTMAX_CROSS_ENTROPY = "next_token_softmax_cross_entropy"
SOFTMAX_CROSS_ENTROPY = "softmax_cross_entropy"
SIGMOID_CROSS_ENTROPY = "sigmoid_cross_entropy"
BINARY_WEIGHTED_CROSS_ENTROPY = "binary_weighted_cross_entropy"
THRESHOLD = "threshold"
VALIDATION_METRIC = "validation_metric"
ACCURACY = "accuracy"
ACCURACY_MICRO = "accuracy_micro"
HITS_AT_K = "hits_at_k"
MEAN_HITS_AT_K = "mean_hits_at_k"
ERROR = "error"
ABSOLUTE_ERROR = "absolute_error"
SQUARED_ERROR = "squared_error"
MEAN_SQUARED_ERROR = "mean_squared_error"
ROOT_MEAN_SQUARED_ERROR = "root_mean_squared_error"
ROOT_MEAN_SQUARED_PERCENTAGE_ERROR = "root_mean_squared_percentage_error"
MEAN_ABSOLUTE_ERROR = "mean_absolute_error"
MEAN_ABSOLUTE_PERCENTAGE_ERROR = "mean_absolute_percentage_error"
HUBER = "huber"
CORN = "corn"
R2 = "r2"
EDIT_DISTANCE = "edit_distance"
PERPLEXITY = "perplexity"
NEXT_TOKEN_PERPLEXITY = "next_token_perplexity"
JACCARD = "jaccard"
PRECISION = "precision"
RECALL = "recall"
SPECIFICITY = "specificity"
PREDICTIONS = "predictions"
RESPONSE = "RESPONSE"
TOP_K = "top_k"
TOP_K_PREDICTIONS = "top_k_predictions"
PROBABILITY = "probability"
PROBABILITIES = "probabilities"
SPLIT_PROBABILITIES = "split_probabilities"
TOKEN_ACCURACY = "token_accuracy"
LAST_ACCURACY = "last_accuracy"
SEQUENCE_ACCURACY = "sequence_accuracy"
LAST_PROBABILITIES = "last_probabilities"
LAST_PREDICTIONS = "last_predictions"
LENGTHS = "lengths"
TIED = "tied"
COMBINED = "combined"
PREPROCESSING = "preprocessing"
FILL_WITH_CONST = "fill_with_const"
FILL_WITH_MODE = "fill_with_mode"
FILL_WITH_MEAN = "fill_with_mean"
FILL_WITH_FALSE = "fill_with_false"
FILL_WITH_TRUE = "fill_with_true"
BFILL = "bfill"
FFILL = "ffill"
DROP_ROW = "drop_row"
MISSING_VALUE_STRATEGY = "missing_value_strategy"
MISSING_VALUE_STRATEGY_OPTIONS = [
FILL_WITH_CONST,
FILL_WITH_MODE,
BFILL,
FFILL,
DROP_ROW,
]
CROP_OR_PAD = "crop_or_pad"
INTERPOLATE = "interpolate"
RESIZE_METHODS = [CROP_OR_PAD, INTERPOLATE]
# Special symbols for text.
STOP_SYMBOL = ""
START_SYMBOL = ""
PADDING_SYMBOL = ""
UNKNOWN_SYMBOL = ""
TRAINER = "trainer"
OPTIMIZER = "optimizer"
METRIC = "metric"
PREDICTION = "prediction"
LOGITS = "logits"
HIDDEN = "hidden"
LAST_HIDDEN = "last_hidden"
ENCODER_OUTPUT = "encoder_output"
ENCODER_OUTPUT_STATE = "encoder_output_state"
PROJECTION_INPUT = "projection_input"
LEARNING_RATE_SCHEDULER = "learning_rate_scheduler"
SEMANTIC = "semantic"
RANDOM = "random"
SUM = "sum"
APPEND = "append"
SEQ_SUM = "seq_sum"
AVG_EXP = "avg_exp"
TRAIN = "train"
TRAINING = "training"
VALIDATION = "validation"
TEST = "test"
EVALUATION = "evaluation"
SPLIT = "split"
FORCE_SPLIT = "force_split"
STRATIFY = "stratify"
FULL = "full"
TRAIN_SPLIT = 0
VALIDATION_SPLIT = 1
TEST_SPLIT = 2
MIN_DATASET_SPLIT_ROWS = 3 # The minimum number of rows in a split. Splits smaller than this size are treated as empty.
META = "meta"
HYPEROPT = "hyperopt"
STRATEGY = "strategy"
EXECUTOR = "executor"
MINIMIZE = "minimize"
MAXIMIZE = "maximize"
SAMPLER = "sampler"
NUM_SAMPLES = "num_samples"
SEARCH_ALG = "search_alg"
SCHEDULER = "scheduler"
PARAMETERS = "parameters"
MAX_CONCURRENT_TRIALS = "max_concurrent_trials"
CPU_RESOURCES_PER_TRIAL = "cpu_resources_per_trial"
GPU_RESOURCES_PER_TRIAL = "gpu_resources_per_trial"
GOAL = "goal"
GRID_SEARCH = "grid_search"
NAME = "name"
COLUMN = "column"
TYPE = "type"
ACTIVE = "active"
RAY = "ray"
IN_MEMORY = "in_memory"
PROC_COLUMN = "proc_column"
CHECKSUM = "checksum"
HDF5 = "hdf5"
PARQUET = "parquet"
SRC = "dataset_src"
EARLY_STOP = "early_stop"
EPOCHS = "epochs"
BATCH_SIZE = "batch_size"
EVAL_BATCH_SIZE = "eval_batch_size"
EFFECTIVE_BATCH_SIZE = "effective_batch_size"
MAX_BATCH_SIZE = "max_batch_size"
DEFAULT_BATCH_SIZE = "auto"
FALLBACK_BATCH_SIZE = 128
# The smallest batch size that is supported on Ludwig.
MINIMUM_BATCH_SIZE = 1
# 2^40. Used for `max_batch_size` config param. Not a hard constraint for `batch_size` config param.
MAX_POSSIBLE_BATCH_SIZE = 1099511627776
# min batch size. Used as a floor for batch size tuning.
MIN_POSSIBLE_BATCH_SIZE = 1
# max batch size for dataset is 20% of dataset size
MAX_BATCH_SIZE_DATASET_FRACTION = 0.2
MAX_CPU_BATCH_SIZE = 128
LEARNING_RATE = "learning_rate"
INPUT_SIZE = "input_size"
USE_BIAS = "use_bias"
BIAS = "bias"
DEFAULT_USE_BIAS = "default_use_bias"
DEFAULT_BIAS = "default_bias"
CONV_USE_BIAS = "conv_use_bias"
CONV_BIAS = "conv_bias"
AUTO = "auto"
CONFIG = "config"
CLIP = "clip"
DEPENDENCIES = "dependencies"
REDUCE_INPUT = "reduce_input"
REDUCE_DEPENDENCIES = "reduce_dependencies"
BACKEND = "backend"
COMBINER = "combiner"
ENCODER = "encoder"
DECODER = "decoder"
TRAINABLE = "trainable"
DEFAULTS = "defaults"
DEFAULT = "default"
DEFAULT_VALIDATION_METRIC = "default_validation_metric"
BALANCE_PERCENTAGE_TOLERANCE = 0.03
IMBALANCE_DETECTION_RATIO = 0.05
TABULAR = "tabular"
AUTOML_DEFAULT_TABULAR_MODEL = "tabnet"
AUTOML_DEFAULT_TEXT_ENCODER = "bert"
AUTOML_SMALLER_TEXT_ENCODER = "distilbert"
AUTOML_TEXT_ENCODER_MAX_TOKEN_LEN = 512
AUTOML_SMALLER_TEXT_LENGTH = 128
AUTOML_LARGE_TEXT_DATASET = 100000
AUTOML_MAX_ROWS_PER_CHECKPOINT = 350000
AUTOML_DEFAULT_IMAGE_ENCODER = "stacked_cnn"
HYPEROPT_WARNING = (
"You are running the ludwig train command but there’s a hyperopt section present in your config. "
"It will be ignored. If you want to run hyperopt you should use the following command: ludwig "
"hyperopt\n\n"
)
CONTINUE_PROMPT = "Do you want to continue? "
DEFAULT_AUDIO_TENSOR_LENGTH = 70000
AUDIO_FEATURE_KEYS = [
"type",
"window_length_in_s",
"window_shift_in_s",
"num_fft_points",
"window_type",
"num_filter_bands",
]
BASE_MODEL = "base_model"
MODEL_TYPE = "model_type"
MODEL_ECD = "ecd"
MODEL_LLM = "llm"
DASK_MODULE_NAME = "dask.dataframe"
LUDWIG_VERSION = "ludwig_version"
PREPROCESSOR = "preprocessor"
PREDICTOR = "predictor"
POSTPROCESSOR = "postprocessor"
TARGET_MODULES = "target_modules"
GENERATION = "generation"
PROMPT = "prompt"
ADAPTER = "adapter"
QUANTIZATION = "quantization"
MIN_QUANTIZATION_BITS_FOR_MERGE_AND_UNLOAD = 8
PRETRAINED_ADAPTER_WEIGHTS = "pretrained_adapter_weights"
MERGE_ADAPTER_INTO_BASE_MODEL = "merge_adapter_into_base_model"
PROGRESSBAR = "progressbar"
# CrossEntropyLoss for LLMs
IGNORE_INDEX_TOKEN_ID = -100
S3 = "s3"
CACHE = "cache"
# If `use_torch_profiler=True` in LudwigProfiler, LUDWIG_TAG is prepended to the specified experiment tag
# (LudwigProfiler(tag="...", ..)). This edited tag is passed in to `torch.profiler.record_function` so we can
# retrieve torch ops for the tagged code blocks/functions.
LUDWIG_TAG = "[ludwig]"
# Retry constants
TRIES = 5
DELAY = 1
BACKOFF = 2
JITTER = (0, 1)
# image support constants
IMAGENET1K = "imagenet1k"
AUGMENTATION = "augmentation"
LUDWIG_SCHEMA_VALIDATION_POLICY = "LUDWIG_SCHEMA_VALIDATION_POLICY"
================================================
FILE: ludwig/contrib.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Module for handling contributed support."""
import argparse
from ludwig.contribs import contrib_registry, ContribLoader
def create_load_action(contrib_loader: ContribLoader) -> argparse.Action:
class LoadContribAction(argparse.Action):
def __call__(self, parser, namespace, values, option_string):
items = getattr(namespace, self.dest) or []
items.append(contrib_loader.load())
setattr(namespace, self.dest, items)
return LoadContribAction
def add_contrib_callback_args(parser: argparse.ArgumentParser):
for contrib_name, contrib_loader in contrib_registry.items():
parser.add_argument(
f"--{contrib_name}",
dest="callbacks",
nargs=0,
action=create_load_action(contrib_loader),
)
def preload(argv):
for arg in argv:
if arg.startswith("--"):
arg = arg[2:]
if arg in contrib_registry:
contrib_registry[arg].preload()
================================================
FILE: ludwig/contribs/__init__.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""All contrib classes must implement the `ludwig.callbacks.Callback` interface.
If you don't want to handle the call, either provide an empty method with `pass`, or just don't implement the method.
"""
from abc import ABC, abstractmethod
from ludwig.callbacks import Callback
class ContribLoader(ABC):
@abstractmethod
def load(self) -> Callback:
"""Returns an instantiation of the callback instance, whose callback hooks will be invoked at runtime."""
def preload(self):
"""Will always be called when Ludwig CLI is invoked, preload gives the callback an opportunity to import or
create any shared resources.
Importing required 3rd-party libraries should be done here i.e. import wandb. preload is guaranteed to be called
before any other callback method, and will only be called once per process.
"""
# Contributors, load your class here:
class AimLoader(ContribLoader):
def load(self) -> Callback:
from ludwig.contribs.aim import AimCallback
return AimCallback()
def preload(self):
import aim # noqa
class CometLoader(ContribLoader):
def load(self) -> Callback:
from ludwig.contribs.comet import CometCallback
return CometCallback()
def preload(self):
import comet_ml # noqa
class WandbLoader(ContribLoader):
def load(self) -> Callback:
from ludwig.contribs.wandb import WandbCallback
return WandbCallback()
def preload(self):
import wandb # noqa
class MlflowLoader(ContribLoader):
def load(self) -> Callback:
from ludwig.contribs.mlflow import MlflowCallback
return MlflowCallback()
contrib_registry = {
# Contributors, add your class here:
"comet": CometLoader(),
"wandb": WandbLoader(),
"mlflow": MlflowLoader(),
"aim": AimLoader(),
}
================================================
FILE: ludwig/contribs/aim.py
================================================
import json
import logging
from ludwig.api_annotations import PublicAPI
from ludwig.callbacks import Callback
from ludwig.utils.data_utils import NumpyEncoder
from ludwig.utils.package_utils import LazyLoader
aim = LazyLoader("aim", globals(), "aim")
logger = logging.getLogger(__name__)
@PublicAPI
class AimCallback(Callback):
"""Class that defines the methods necessary to hook into process."""
def __init__(self, repo=None):
self.repo = repo
def on_train_init(
self,
base_config,
experiment_directory,
experiment_name,
model_name,
output_directory,
resume_directory,
):
logger.info("aim.on_train_init() called...")
try:
query = f'run.name == "{model_name}"'
if self.repo is None:
aim_repo = aim.Repo.default_repo()
else:
aim_repo = aim.Repo.from_path(self.repo)
runs_generator = aim_repo.query_runs(query)
run = next(runs_generator.iter_runs())
run_hash = run.run.hash
self.aim_run = aim.Run(run_hash=run_hash, repo=self.repo, experiment=experiment_name)
except Exception:
self.aim_run = aim.Run(repo=self.repo, experiment=experiment_name)
self.aim_run.name = model_name
self.aim_run["base_config"] = self.normalize_config(base_config)
params = dict(name=model_name, dir=experiment_directory)
self.aim_run["params"] = params
def aim_track(self, progress_tracker):
logger.info(f"aim.aim_track() called for epoch {progress_tracker.epoch}, step: {progress_tracker.steps}")
if self.aim_run:
for key, value in progress_tracker.log_metrics().items():
if "metrics" in key and "best" not in key:
metrics_dict_name, feature_name, metric_name = key.split(".")
self.aim_run.track(
value,
name=metric_name,
context={metrics_dict_name: feature_name},
epoch=progress_tracker.epoch,
step=progress_tracker.steps,
)
def on_trainer_train_teardown(self, trainer, progress_tracker, save_path, is_coordinator: bool):
pass
def on_train_start(self, model, config, *args, **kwargs):
logger.info("aim.on_train_start() called...")
config = config.copy()
del config["input_features"]
del config["output_features"]
self.aim_run["train_config"] = self.normalize_config(config)
def on_train_end(self, output_directory, *args, **kwargs):
pass
def on_eval_end(self, trainer, progress_tracker, save_path):
optimizer_config = {}
for index, group in enumerate(trainer.optimizer.param_groups):
for key in group:
if "param" not in key:
optimizer_config[f"param_group_{index}_{key}"] = group[key]
self.aim_run["optimizer_config"] = self.normalize_config(optimizer_config)
self.aim_track(progress_tracker)
def on_ludwig_end(self):
self.aim_run.close()
self.aim_run = None
def on_visualize_figure(self, fig):
logger.info("aim.on_visualize_figure() called...")
if self.aim_run:
self.aim_run.track(aim.Figure(fig), name="Figure", context={"type": "Training Figure"})
@staticmethod
def normalize_config(config):
"""Convert to json string and back again to remove numpy types."""
return json.loads(json.dumps(config, cls=NumpyEncoder))
================================================
FILE: ludwig/contribs/comet.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import os
from datetime import datetime
from ludwig.api_annotations import PublicAPI
from ludwig.callbacks import Callback
from ludwig.utils.package_utils import LazyLoader
comet_ml = LazyLoader("comet_ml", globals(), "comet_ml")
logger = logging.getLogger(__name__)
@PublicAPI
class CometCallback(Callback):
"""Class that defines the methods necessary to hook into process."""
def __init__(self):
self.cometml_experiment = None
def on_train_init(
self,
base_config,
experiment_directory,
experiment_name,
model_name,
output_directory,
resume_directory,
):
if self.cometml_experiment:
# Comet ML already initialized
return
try:
self.cometml_experiment = comet_ml.Experiment(log_code=False, project_name=experiment_name)
except Exception:
self.cometml_experiment = None
logger.exception("comet_ml.Experiment() had errors. Perhaps you need to define COMET_API_KEY")
raise
self.cometml_experiment.set_name(model_name)
self.cometml_experiment.set_filename("Ludwig API")
config = comet_ml.get_config()
self._save_config(config, directory=experiment_directory)
def on_train_start(self, model, config, config_fp, *args, **kwargs):
if self.cometml_experiment:
# todo v0.4: currently not clear way to set model graph
# see: https://github.com/comet-ml/issue-tracking/issues/296
# if model:
# self.cometml_experiment.set_model_graph(
# str(model._graph.as_graph_def()))
if config:
if config_fp:
base_name = os.path.basename(config_fp)
else:
base_name = "config.yaml"
if "." in base_name:
base_name = base_name.rsplit(".", 1)[0] + ".json"
else:
base_name = base_name + ".json"
self.cometml_experiment.log_asset_data(config, base_name)
def on_train_end(self, output_directory, *args, **kwargs):
if self.cometml_experiment:
self.cometml_experiment.log_asset_folder(output_directory)
def on_eval_end(self, trainer, progress_tracker, save_path):
"""Called from ludwig/models/model.py."""
if self.cometml_experiment:
for key, value in progress_tracker.log_metrics().items():
self.cometml_experiment.log_metric(key, value)
def on_epoch_end(self, trainer, progress_tracker, save_path):
"""Called from ludwig/models/model.py."""
if self.cometml_experiment:
for key, value in progress_tracker.log_metrics().items():
self.cometml_experiment.log_metric(key, value)
def on_visualize_figure(self, fig):
if self.cometml_experiment:
self.cometml_experiment.log_figure(fig)
def on_cmdline(self, cmd, *args):
self.cometml_experiment = None
if cmd in {"train", "experiment"}:
# create a new experiment
try:
self.cometml_experiment = comet_ml.Experiment(log_code=False)
except Exception:
logger.exception("comet_ml.Experiment() had errors. Perhaps you need to define COMET_API_KEY")
return
elif cmd in {"visualize", "predict", "evaluate"}:
# restore from an existing experiment
try:
self.cometml_experiment = comet_ml.ExistingExperiment()
except Exception:
logger.exception("Ignored --comet. No '.comet.config' file")
return
else:
# unhandled command
return
cli = self._make_command_line(cmd, args)
self.cometml_experiment.set_code(cli)
self.cometml_experiment.set_filename("Ludwig CLI")
self._log_html(cli)
config = comet_ml.get_config()
self._save_config(config)
def _save_config(self, config, directory="."):
# save the .comet.config here:
config["comet.experiment_key"] = self.cometml_experiment.id
config.save(directory=directory)
def _log_html(self, text):
# log the text to the html tab:
now = datetime.now()
timestamp = now.strftime("%m/%d/%Y %H:%M:%S")
self.cometml_experiment.log_html(f"
{timestamp}: {text}
")
def _make_command_line(self, cmd, args):
# put the commet flag back in:
arg_str = " ".join(list(args[:2]) + ["--comet"] + list(args[2:]))
return f"ludwig {cmd} {arg_str}"
================================================
FILE: ludwig/contribs/mlflow/__init__.py
================================================
import logging
import os
import queue
import threading
from ludwig.api_annotations import DeveloperAPI, PublicAPI
from ludwig.callbacks import Callback
from ludwig.constants import TRAINER
from ludwig.globals import MODEL_FILE_NAME, MODEL_HYPERPARAMETERS_FILE_NAME, TRAIN_SET_METADATA_FILE_NAME
from ludwig.types import TrainingSetMetadataDict
from ludwig.utils.data_utils import chunk_dict, flatten_dict, save_json, to_json_dict
from ludwig.utils.package_utils import LazyLoader
mlflow = LazyLoader("mlflow", globals(), "mlflow")
logger = logging.getLogger(__name__)
def _get_runs(experiment_id: str):
return mlflow.tracking.client.MlflowClient().search_runs([experiment_id])
@DeveloperAPI
def get_or_create_experiment_id(experiment_name, artifact_uri: str = None):
"""Gets experiment id from mlflow."""
experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is not None:
return experiment.experiment_id
return mlflow.create_experiment(name=experiment_name, artifact_location=artifact_uri)
# Included for backwards compatibility, Deprecated.
# TODO(daniel): delete this.
_get_or_create_experiment_id = get_or_create_experiment_id
@PublicAPI
class MlflowCallback(Callback):
def __init__(self, tracking_uri=None, log_artifacts: bool = True):
self.logged_steps = set()
if tracking_uri:
mlflow.set_tracking_uri(tracking_uri)
self.tracking_uri = mlflow.get_tracking_uri()
active_run = mlflow.active_run()
if active_run is not None:
# Use experiment already set in the current environment
self.run = active_run
self.experiment_id = self.run.info.experiment_id
self.experiment_name = mlflow.get_experiment(self.experiment_id).name
self.external_run = True
else:
# Will create an experiment at training time
self.run = None
self.experiment_id = None
self.experiment_name = None
self.external_run = False
self.run_ended = False
self.training_set_metadata = None
self.config = None
self.save_in_background = True
self.save_fn = None
self.save_thread = None
self.log_artifacts = log_artifacts
def get_experiment_id(self, experiment_name):
return get_or_create_experiment_id(experiment_name)
def on_preprocess_end(
self,
training_set: "Dataset", # noqa
validation_set: "Dataset", # noqa
test_set: "Dataset", # noqa
training_set_metadata: TrainingSetMetadataDict,
):
self.training_set_metadata = training_set_metadata
def on_hyperopt_init(self, experiment_name):
self.experiment_id = self.get_experiment_id(experiment_name)
self.experiment_name = experiment_name
def on_hyperopt_trial_start(self, parameters):
# Filter out mlflow params like tracking URI, experiment ID, etc.
params = {k: v for k, v in parameters.items() if k != "mlflow"}
self._log_params({"hparam": params})
# TODO(travis): figure out a good way to support this. The problem with
# saving artifacts in the background with hyperopt is early stopping. If
# the scheduler decides to terminate a process, then currently there's no
# mechanism to detect this a "flush" the queue of pending writes before
# stopping. Should work with Ray Tune team to come up with a solution.
self.save_in_background = False
def on_train_init(self, base_config, experiment_name, output_directory, resume_directory, **kwargs):
# Experiment may already have been set during hyperopt init, in
# which case we don't want to create a new experiment / run, as
# this should be handled by the executor.
if self.experiment_id is None:
mlflow.end_run()
self.experiment_id = self.get_experiment_id(experiment_name)
self.experiment_name = experiment_name
active_run = mlflow.active_run()
if active_run is not None:
# Currently active run started by Ray Tune MLflow mixin or external run
self.run = active_run
else:
run_id = None
if resume_directory is not None:
previous_runs = _get_runs(self.experiment_id)
if len(previous_runs) > 0:
run_id = previous_runs[0].info.run_id
if run_id is not None:
self.run = mlflow.start_run(run_id=run_id)
else:
run_name = os.path.basename(output_directory)
self.run = mlflow.start_run(experiment_id=self.experiment_id, run_name=run_name)
self.log_config(base_config)
def log_config(self, config):
if self.log_artifacts:
mlflow.log_dict(to_json_dict(config), "config.yaml")
def on_train_start(self, config, **kwargs):
self.config = config
self._log_params({TRAINER: config[TRAINER]})
def on_train_end(self, output_directory):
if self.log_artifacts:
_log_artifacts(output_directory)
if self.run is not None and not self.external_run:
# Only end runs managed internally to this callback
mlflow.end_run()
self.run_ended = True
def on_trainer_train_setup(self, trainer, save_path, is_coordinator):
if not is_coordinator:
return
# When running on a remote worker, the model metadata files will only have been
# saved to the driver process, so re-save it here before uploading.
training_set_metadata_path = os.path.join(save_path, TRAIN_SET_METADATA_FILE_NAME)
if not os.path.exists(training_set_metadata_path):
save_json(training_set_metadata_path, self.training_set_metadata)
model_hyperparameters_path = os.path.join(save_path, MODEL_HYPERPARAMETERS_FILE_NAME)
if not os.path.exists(model_hyperparameters_path):
save_json(model_hyperparameters_path, self.config)
if self.save_in_background:
save_queue = queue.Queue()
self.save_fn = lambda args: save_queue.put(args)
self.save_thread = threading.Thread(target=_log_mlflow_loop, args=(save_queue, self.log_artifacts))
self.save_thread.start()
else:
self.save_fn = lambda args: _log_mlflow(*args, self.log_artifacts)
def on_eval_end(self, trainer, progress_tracker, save_path):
if progress_tracker.steps not in self.logged_steps:
self.logged_steps.add(progress_tracker.steps)
# Adds a tuple to the logging queue.
# True is passed to indicate that the background saving loop should continue.
self.save_fn((progress_tracker.log_metrics(), progress_tracker.steps, save_path, True))
def on_trainer_train_teardown(self, trainer, progress_tracker, save_path, is_coordinator):
if is_coordinator:
if progress_tracker.steps not in self.logged_steps:
self.logged_steps.add(progress_tracker.steps)
# Adds a tuple to the logging queue.
# False is passed to indicate that the background saving loop should break.
self.save_fn((progress_tracker.log_metrics(), progress_tracker.steps, save_path, False))
# False ensures that the background saving loop breaks.
# TODO(Justin): This should probably live in on_ludwig_end, once that's implemented.
self.save_fn((None, None, None, False))
# Close the save_thread.
if self.save_thread is not None:
self.save_thread.join()
# if self.save_thread.is_alive():
# logger.warning("MLFlow save thread timed out and did not close properly.")
def on_visualize_figure(self, fig):
# TODO: need to also include a filename for this figure
# mlflow.log_figure(fig)
pass
def prepare_ray_tune(self, train_fn, tune_config, tune_callbacks):
from functools import wraps
from ray.air.integrations.mlflow import setup_mlflow
mlflow_config = {
"experiment_id": self.experiment_id,
"experiment_name": self.experiment_name,
"tracking_uri": mlflow.get_tracking_uri(),
}
@wraps(train_fn)
def wrapper(config, **kwargs):
setup_mlflow(config, **mlflow_config)
return train_fn(config, **kwargs)
return wrapper, {
**tune_config,
}
def _log_params(self, params):
flat_params = flatten_dict(params)
for chunk in chunk_dict(flat_params, chunk_size=100):
mlflow.log_params(chunk)
def __setstate__(self, d):
self.__dict__ = d
if self.tracking_uri:
mlflow.set_tracking_uri(self.tracking_uri)
if self.run and not self.run_ended:
# Run has already been set, but may not be active due to training workers running in a separate
# process, so resume the run
mlflow.end_run()
self.run = mlflow.start_run(run_id=self.run.info.run_id, experiment_id=self.run.info.experiment_id)
def _log_mlflow_loop(q: queue.Queue, log_artifacts: bool = True):
"""The save_fn for the background thread that logs to MLFlow when save_in_background is True."""
should_continue = True
while should_continue:
elem = q.get()
log_metrics, steps, save_path, should_continue = elem
if log_metrics is None:
# Break out of the loop if we're not going to log anything.
break
if "llm_eval_examples" in log_metrics and log_metrics["llm_eval_examples"] is not None:
# mlflow.log_dict(log_metrics["llm_eval_examples"], artifact_file="llm_eval_examples.json")
# Delete the table from the metrics dict so we don't try to log it with the other metrics
del log_metrics["llm_eval_examples"]
mlflow.log_metrics(log_metrics, step=steps)
if not q.empty():
# in other words, don't bother saving the model artifacts
# if we're about to do it again
continue
if log_artifacts:
_log_model(save_path)
def _log_mlflow(log_metrics, steps, save_path, should_continue, log_artifacts: bool = True):
"""The save_fn for the MlflowCallback.
This is used when save_in_background is False.
"""
if log_metrics is not None:
if "llm_eval_examples" in log_metrics and log_metrics["llm_eval_examples"] is not None:
# mlflow.log_dict(log_metrics["llm_eval_examples"], artifact_file="llm_eval_examples.json")
# Delete the table from the metrics dict so we don't try to log it with the other metrics
del log_metrics["llm_eval_examples"]
mlflow.log_metrics(log_metrics, step=steps)
if log_artifacts:
_log_model(save_path)
def _log_artifacts(output_directory):
try:
contents = os.listdir(output_directory)
except FileNotFoundError:
logger.warning(f"_log_artifacts: output_directory does not exist: {output_directory}")
return
for fname in contents:
lpath = os.path.join(output_directory, fname)
if fname == MODEL_FILE_NAME:
_log_model(lpath)
else:
mlflow.log_artifact(lpath)
def _log_model(lpath):
# Lazy import to avoid requiring this package
from ludwig.contribs.mlflow.model import log_saved_model
log_saved_model(lpath)
================================================
FILE: ludwig/contribs/mlflow/model.py
================================================
import logging
import os
import shutil
import tempfile
import mlflow
import yaml
from mlflow import pyfunc
from mlflow.exceptions import MlflowException
from mlflow.models import Model
from mlflow.models.model import MLMODEL_FILE_NAME
from mlflow.models.signature import ModelSignature
from mlflow.models.utils import _save_example, ModelInputExample
from mlflow.tracking._model_registry import DEFAULT_AWAIT_MAX_SLEEP_SECONDS
from mlflow.tracking.artifact_utils import _download_artifact_from_uri
from mlflow.utils.environment import _mlflow_conda_env
from mlflow.utils.model_utils import _get_flavor_configuration
from ludwig.api_annotations import DeveloperAPI
from ludwig.globals import MODEL_FILE_NAME, MODEL_HYPERPARAMETERS_FILE_NAME
from ludwig.utils.data_utils import load_json
FLAVOR_NAME = "ludwig"
_logger = logging.getLogger(__name__)
def get_default_conda_env():
"""
:return: The default Conda environment for MLflow Models produced by calls to
:func:`save_model()` and :func:`log_model()`.
"""
import ludwig
# Ludwig is not yet available via the default conda channels, so we install it via pip
return _mlflow_conda_env(
additional_conda_deps=None,
additional_pip_deps=[f"ludwig=={ludwig.__version__}"],
additional_conda_channels=None,
)
def save_model(
ludwig_model,
path,
conda_env=None,
mlflow_model=None,
signature: ModelSignature = None,
input_example: ModelInputExample = None,
**kwargs,
):
"""Save a Ludwig model to a path on the local file system.
:param ludwig_model: Ludwig model (an instance of `ludwig.api.LudwigModel`_) to be saved.
:param path: Local path where the model is to be saved.
:param conda_env: Either a dictionary representation of a Conda environment or the path to a
Conda environment yaml file. If provided, this describes the environment
this model should be run in. At minimum, it should specify the dependencies
contained in :func:`get_default_conda_env()`. If ``None``, the default
:func:`get_default_conda_env()` environment is added to the model.
The following is an *example* dictionary representation of a Conda
environment::
{
'name': 'mlflow-env',
'channels': ['defaults'],
'dependencies': [
'python=3.7.0',
'pip': [
'ludwig==0.4.0'
]
]
}
:param mlflow_model: :py:mod:`mlflow.models.Model` this flavor is being added to.
:param signature: (Experimental) :py:class:`ModelSignature `
describes model input and output :py:class:`Schema `.
The model signature can be :py:func:`inferred `
from datasets with valid model input (e.g. the training dataset with target
column omitted) and valid model output (e.g. model predictions generated on
the training dataset), for example:
.. code-block:: python
from mlflow.models.signature import infer_signature
train = df.drop_column("target_label")
predictions = ... # compute model predictions
signature = infer_signature(train, predictions)
:param input_example: (Experimental) Input example provides one or several instances of valid
model input. The example can be used as a hint of what data to feed the
model. The given example will be converted to a Pandas DataFrame and then
serialized to json using the Pandas split-oriented format. Bytes are
base64-encoded.
"""
import ludwig
path = os.path.abspath(path)
if os.path.exists(path):
raise MlflowException(f"Path '{path}' already exists")
model_data_subpath = MODEL_FILE_NAME
model_data_path = os.path.join(path, model_data_subpath)
os.makedirs(path)
if mlflow_model is None:
mlflow_model = Model()
if signature is not None:
mlflow_model.signature = signature
if input_example is not None:
_save_example(mlflow_model, input_example, path)
# Save the Ludwig model
ludwig_model.save(model_data_path)
conda_env_subpath = "conda.yaml"
if conda_env is None:
conda_env = get_default_conda_env()
elif not isinstance(conda_env, dict):
with open(conda_env) as f:
conda_env = yaml.safe_load(f)
with open(os.path.join(path, conda_env_subpath), "w") as f:
yaml.safe_dump(conda_env, stream=f, default_flow_style=False)
pyfunc.add_to_model(
mlflow_model,
loader_module="ludwig.contribs.mlflow.model",
data=model_data_subpath,
env=conda_env_subpath,
)
schema_keys = {"name", "column", "type"}
config = ludwig_model.config
mlflow_model.add_flavor(
FLAVOR_NAME,
ludwig_version=ludwig.__version__,
ludwig_schema={
"input_features": [
{k: v for k, v in feature.items() if k in schema_keys} for feature in config["input_features"]
],
"output_features": [
{k: v for k, v in feature.items() if k in schema_keys} for feature in config["output_features"]
],
},
data=model_data_subpath,
)
mlflow_model.save(os.path.join(path, MLMODEL_FILE_NAME))
def log_model(
ludwig_model,
artifact_path,
conda_env=None,
registered_model_name=None,
signature: ModelSignature = None,
input_example: ModelInputExample = None,
await_registration_for=DEFAULT_AWAIT_MAX_SLEEP_SECONDS,
):
"""Log a Ludwig model as an MLflow artifact for the current run.
Saves the model locally in MLflow format, then logs it as a run artifact using mlflow.log_artifacts(). This ensures
the model appears as a run artifact (compatible with MLflow 3.x where Model.log() uses the model registry instead).
"""
with tempfile.TemporaryDirectory() as tmpdir:
local_path = os.path.join(tmpdir, "model")
save_model(
ludwig_model,
path=local_path,
conda_env=conda_env,
signature=signature,
input_example=input_example,
)
mlflow.log_artifacts(local_path, artifact_path)
if registered_model_name is not None:
run_id = mlflow.active_run().info.run_id
mlflow.register_model(
f"runs:/{run_id}/{artifact_path}",
registered_model_name,
await_registration_for=await_registration_for,
)
def _load_model(path):
from ludwig.api import LudwigModel
return LudwigModel.load(path, backend="local")
def _load_pyfunc(path):
"""Load PyFunc implementation. Called by ``pyfunc.load_pyfunc``.
:param path: Local filesystem path to the MLflow Model with the ``ludwig`` flavor.
"""
return _LudwigModelWrapper(_load_model(path))
def load_model(model_uri):
"""Load a Ludwig model from a local file or a run.
:param model_uri: The location, in URI format, of the MLflow model. For example:
- ``/Users/me/path/to/local/model``
- ``relative/path/to/local/model``
- ``s3://my_bucket/path/to/model``
- ``runs://run-relative/path/to/model``
For more information about supported URI schemes, see
`Referencing Artifacts `_.
:return: A Ludwig model (an instance of `ludwig.api.LudwigModel`_).
"""
local_model_path = _download_artifact_from_uri(artifact_uri=model_uri)
flavor_conf = _get_flavor_configuration(model_path=local_model_path, flavor_name=FLAVOR_NAME)
model_data_path = os.path.join(local_model_path, flavor_conf.get("data", "model"))
return _load_model(path=model_data_path)
class _LudwigModelWrapper:
def __init__(self, ludwig_model):
self.ludwig_model = ludwig_model
def predict(self, dataframe):
pred_df, _ = self.ludwig_model.predict(dataframe)
return pred_df
def export_model(model_path, output_path, registered_model_name=None):
if registered_model_name:
if not model_path.startswith("runs:/") or output_path is not None:
# No run specified, so in order to register the model in mlflow, we need
# to create a new run and upload the model as an artifact first
output_path = output_path or MODEL_FILE_NAME
log_model(
_CopyModel(model_path),
artifact_path=output_path,
registered_model_name=registered_model_name,
)
else:
# Registering a model from an artifact of an existing run
mlflow.register_model(
model_path,
registered_model_name,
)
else:
# No model name means we only want to save the model locally
save_model(
_CopyModel(model_path),
path=output_path,
)
@DeveloperAPI
def log_saved_model(lpath):
"""Log a saved Ludwig model directory as a proper MLflow model artifact."""
if os.path.isdir(lpath):
log_model(
_CopyModel(lpath),
artifact_path="model",
)
elif os.path.isfile(lpath):
mlflow.log_artifact(lpath, "model")
class _CopyModel:
"""Get model data without requiring us to read the model weights into memory."""
def __init__(self, lpath):
self.lpath = lpath
def save(self, path):
shutil.copytree(self.lpath, path)
@property
def config(self):
return load_json(os.path.join(self.lpath, MODEL_HYPERPARAMETERS_FILE_NAME))
================================================
FILE: ludwig/contribs/wandb.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import os
from ludwig.api_annotations import PublicAPI
from ludwig.callbacks import Callback
from ludwig.utils.package_utils import LazyLoader
wandb = LazyLoader("wandb", globals(), "wandb")
logger = logging.getLogger(__name__)
@PublicAPI
class WandbCallback(Callback):
"""Class that defines the methods necessary to hook into process."""
def on_train_init(
self,
base_config,
experiment_directory,
experiment_name,
model_name,
output_directory,
resume_directory,
):
logger.info("wandb.on_train_init() called...")
wandb.init(
project=os.getenv("WANDB_PROJECT", experiment_name),
name=model_name,
sync_tensorboard=True,
dir=output_directory,
)
wandb.save(os.path.join(experiment_directory, "*"))
def on_train_start(self, model, config, *args, **kwargs):
logger.info("wandb.on_train_start() called...")
config = config.copy()
del config["input_features"]
del config["output_features"]
wandb.config.update(config)
def on_eval_end(self, trainer, progress_tracker, save_path):
"""Called from ludwig/models/model.py."""
for key, value in progress_tracker.log_metrics().items():
wandb.log({key: value})
def on_epoch_end(self, trainer, progress_tracker, save_path):
"""Called from ludwig/models/model.py."""
for key, value in progress_tracker.log_metrics().items():
wandb.log({key: value})
def on_visualize_figure(self, fig):
logger.info("wandb.on_visualize_figure() called...")
if wandb.run:
wandb.log({"figure": fig})
def on_train_end(self, output_directory):
wandb.finish()
================================================
FILE: ludwig/data/__init__.py
================================================
================================================
FILE: ludwig/data/batcher/__init__.py
================================================
================================================
FILE: ludwig/data/batcher/base.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from abc import ABC, abstractmethod
import numpy as np
class Batcher(ABC):
@abstractmethod
def next_batch(self) -> dict[str, np.ndarray]:
raise NotImplementedError()
@abstractmethod
def last_batch(self) -> bool:
raise NotImplementedError()
@abstractmethod
def set_epoch(self, epoch: int, batch_size: int):
raise NotImplementedError()
================================================
FILE: ludwig/data/batcher/bucketed.py
================================================
#! /usr/bin/env python
# Copyright (c) 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import numpy as np
from ludwig.data.batcher.base import Batcher
class BucketedBatcher(Batcher):
def __init__(
self,
dataset,
bucketing_field,
batch_size=128,
buckets=10,
should_shuffle=True,
ignore_last=False,
should_trim=False,
trim_side="right",
):
self.should_shuffle = should_shuffle
self.bucketing_field = bucketing_field
self.should_trim = should_trim
self.trim_side = trim_side
# store our dataset as well
self.dataset = dataset
field = dataset.get_dataset()[bucketing_field]
field_lengths = np.apply_along_axis(lambda x: np.sign(x).sum(), 1, field)
sorted_idcs = np.argsort(field_lengths)
self.buckets_idcs = []
datapoints_per_bucket = len(field) // buckets
for b in range(buckets):
start = datapoints_per_bucket * b
end = datapoints_per_bucket * (b + 1) if b < buckets - 1 else len(sorted_idcs)
self.buckets_idcs.append(sorted_idcs[start:end])
if should_shuffle:
self.shuffle(self.buckets_idcs)
self.ignore_last = ignore_last
self.batch_size = batch_size
self.total_size = min(map(len, dataset.get_dataset().values()))
self.bucket_sizes = np.array([x for x in map(len, self.buckets_idcs)])
self.steps_per_epoch = self._compute_steps_per_epoch()
self.indices = np.array([0] * buckets)
self.step = 0
self.epoch = 0
def shuffle(self, buckets_idcs):
for i in range(len(buckets_idcs)):
np.random.shuffle(buckets_idcs[i])
def next_batch(self):
if self.last_batch():
if self.should_shuffle:
self.shuffle(self.buckets_idcs)
self.set_epoch(self.epoch + 1)
if self.ignore_last:
idcs_below_size = self.indices + self.batch_size < self.bucket_sizes
else:
idcs_below_size = self.indices < self.bucket_sizes
i = np.random.choice(np.arange(0, len(self.buckets_idcs))[idcs_below_size])
selected_bucket = self.buckets_idcs[i]
selected_idcs = selected_bucket[self.indices[i] : self.indices[i] + self.batch_size]
sub_batch = {}
for key in self.dataset.get_dataset():
if key == self.bucketing_field and self.should_trim:
selected_samples = self.dataset.get(key, selected_idcs)
max_length = np.sign(selected_samples).sum(axis=1).max()
if self.trim_side == "right":
sub_batch[key] = selected_samples[:, :max_length]
elif self.trim_side == "left":
sub_batch[key] = selected_samples[:, -max_length:]
else:
raise ValueError("Invalid trim side:", self.trim_side)
else:
sub_batch[key] = self.dataset.get(key, selected_idcs)
self.indices[i] += self.batch_size
self.step += 1
return sub_batch
def last_batch(self):
return not np.any(self.indices < self.bucket_sizes) or (
self.ignore_last and not np.any(self.indices + self.batch_size < self.bucket_sizes)
)
def set_epoch(self, epoch, batch_size):
self.indices = np.array([0] * len(self.buckets_idcs))
self.step = 0
self.epoch = epoch
self.batch_size = batch_size
self.steps_per_epoch = self._compute_steps_per_epoch()
def _compute_steps_per_epoch(self) -> int:
return int(np.sum(np.ceil(self.bucket_sizes / self.batch_size)).item())
# dynamic_length_encoders = {
# 'rnn',
# 'embed'
# }
#
# todo future: reintroduce the bucketed batcher
# def initialize_batcher(dataset, batch_size=128, bucketing_field=None,
# input_features=None, preprocessing=None,
# should_shuffle=True, ignore_last=False):
# if bucketing_field is not None:
# bucketing_feature = [
# feature for feature in input_features if
# feature[NAME] == bucketing_field
# ]
# if not bucketing_feature:
# raise ValueError(
# 'Bucketing field {} not present in input features'.format(
# bucketing_field
# )
# )
# else:
# bucketing_feature = bucketing_feature[0]
# should_trim = bucketing_feature[
# 'encoder'] in dynamic_length_encoders
# if 'preprocessing' in bucketing_feature:
# trim_side = bucketing_feature['preprocessing']['padding']
# else:
# trim_side = preprocessing[bucketing_feature[TYPE]]['padding']
#
# batcher = BucketedBatcher(
# dataset,
# bucketing_field=bucketing_field,
# batch_size=batch_size,
# buckets=10,
# ignore_last=ignore_last,
# should_shuffle=should_shuffle,
# should_trim=should_trim,
# trim_side=trim_side
# )
# else:
# batcher = Batcher(
# dataset,
# batch_size,
# should_shuffle=should_shuffle,
# ignore_last=ignore_last
# )
# return batcher
================================================
FILE: ludwig/data/batcher/iterable.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from ludwig.data.batcher.base import Batcher
class IterableBatcher(Batcher):
def __init__(self, dataset, data, steps_per_epoch, ignore_last=False):
self.dataset = dataset
self.data = data
self.data_it = iter(data)
self.ignore_last = ignore_last
self.steps_per_epoch = steps_per_epoch
self.step = 0
def next_batch(self):
if self.last_batch():
raise StopIteration()
sub_batch = {}
batch = next(self.data_it)
for features_name in self.dataset.features:
sub_batch[features_name] = self.dataset.get(features_name, batch)
self.step += 1
return sub_batch
def last_batch(self):
return self.step >= self.steps_per_epoch or (self.ignore_last and self.step + 1 >= self.steps_per_epoch)
def set_epoch(self, epoch, batch_size):
# TODO ray: implement dynamic batch size
self.step = 0
================================================
FILE: ludwig/data/batcher/random_access.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import math
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.data.batcher.base import Batcher
logger = logging.getLogger(__name__)
@DeveloperAPI
class RandomAccessBatcher(Batcher):
def __init__(self, dataset, sampler, batch_size=128, ignore_last=False, augmentation_pipeline=None):
# store our dataset as well
self.dataset = dataset
self.sampler = sampler
self.sample_it = iter(self.sampler)
self.ignore_last = ignore_last
self.batch_size = batch_size
self.total_size = len(sampler)
self.augmentation_pipeline = augmentation_pipeline
self.steps_per_epoch = self._compute_steps_per_epoch()
self.index = 0
self.step = 0
def next_batch(self):
if self.last_batch():
raise StopIteration()
indices = []
for _ in range(self.batch_size):
try:
indices.append(next(self.sample_it))
self.index += 1
except StopIteration:
break
sub_batch = {feature_name: self.dataset.get(feature_name, indices) for feature_name in self.dataset.features}
if self.augmentation_pipeline:
for feature_name, augmentations in self.augmentation_pipeline.items():
logger.debug(f"RandomAccessBatcher applying augmentation pipeline to batch for feature {feature_name}")
sub_batch[feature_name] = augmentations(torch.tensor(sub_batch[feature_name]))
self.step += 1
return sub_batch
def last_batch(self):
"""Returns whether we've exhausted all batches for this epoch.
If False, then there is at least 1 more batch available with next_batch().
"""
# If our current index in the dataset exceeds the size of the dataset,
# we've finished the epoch and can indicate that this is the last batch
if self.index >= self.total_size:
return True
# This avoids the case where batch size > total size and no steps have been done.
# For e.g., batch size = 128 but the dataset only has 100 rows.
elif self.ignore_last and self.step:
# index += batch_size after each epoch. So, if our current index in total dataset is 1 less than the total
# dataset size, then the last batch will only have 1 row.
# If this happens, we drop the last batch, unless batch_size is 1.
if self.batch_size > 1 and self.index - self.total_size == -1:
logger.info("Last batch in epoch only has 1 sample and will be dropped.")
return True
return False
def set_epoch(self, epoch, batch_size):
self.batch_size = batch_size
self.steps_per_epoch = self._compute_steps_per_epoch()
self.index = 0
self.step = 0
self.sampler.set_epoch(epoch)
self.sample_it = iter(self.sampler)
def _compute_steps_per_epoch(self):
return int(math.ceil(self.total_size / self.batch_size))
================================================
FILE: ludwig/data/batcher/test_batcher.py
================================================
import logging
import pandas as pd
import yaml
from ludwig.api import LudwigModel
from ludwig.data.dataset.pandas import PandasDataset
def test_pandas_size():
df = pd.DataFrame(
{"name": ["joe", "janice", "sara"], "mask": ["green", "black", "pink"], "weapon": ["stick", "gun", "gun"]}
)
config = yaml.safe_load("""
model_type: llm
base_model: HuggingFaceH4/tiny-random-LlamaForCausalLM
input_features:
- name: name
type: text
preprocessing:
max_sequence_length: 256
column: name
output_features:
- name: weapon
type: text
preprocessing:
max_sequence_length: 256
column: weapon
preprocessing:
split:
type: random
probabilities:
- 1
- 0
- 0
""")
model = LudwigModel(config=config, logging_level=logging.INFO)
data = model.preprocess(df, skip_save_processed_input=False)
training_set = data[0]
assert training_set.size == len(df)
# Check if string loading works as well
# data[0].data_hdf5_fp is the string filepath to the cached data from preprocessing
data_from_str = PandasDataset(data[0].data_hdf5_fp, data[0].features, None)
assert data_from_str.size == len(df)
def test_pandas_batcher_use_all_samples():
df = pd.DataFrame(
{"name": ["joe", "janice", "sara"], "mask": ["green", "black", "pink"], "weapon": ["stick", "gun", "gun"]}
)
config = yaml.safe_load("""
model_type: llm
base_model: HuggingFaceH4/tiny-random-LlamaForCausalLM
input_features:
- name: name
type: text
preprocessing:
max_sequence_length: 256
column: name
output_features:
- name: weapon
type: text
preprocessing:
max_sequence_length: 256
column: weapon
preprocessing:
split:
type: random
probabilities:
- 1
- 0
- 0
""")
model = LudwigModel(config=config, logging_level=logging.INFO)
data = model.preprocess(df, skip_save_processed_input=False)
training_set = data[0]
features = training_set.dataset.keys()
batches = []
with training_set.initialize_batcher(batch_size=1) as batcher:
while not batcher.last_batch():
batch = batcher.next_batch()
batches.append(batch)
assert (len(batches)) == training_set.size
# Check to see if all items are used exactly once
for feature in features:
for i in range(len(training_set.dataset[feature])):
# Each of the arrays in the line below should contain the vector representation of a feature of sample i
assert (batches[i][feature].squeeze() == training_set.dataset[feature][i].squeeze()).all()
# Check if string loading works as well
batches = []
# data[0].data_hdf5_fp is the string filepath to the cached data from preprocessing
data_from_str = PandasDataset(data[0].data_hdf5_fp, data[0].features, None)
features = data_from_str.dataset.keys()
with data_from_str.initialize_batcher(batch_size=1) as batcher:
while not batcher.last_batch():
batch = batcher.next_batch()
batches.append(batch)
assert (len(batches)) == data_from_str.size
# Check to see if all items are used exactly once
for feature in features:
for i in range(len(data_from_str.dataset[feature])):
# Each of the arrays in the line below should contain the vector representation of a feature of sample i
assert (batches[i][feature].squeeze() == data_from_str.dataset[feature][i].squeeze()).all()
================================================
FILE: ludwig/data/cache/__init__.py
================================================
================================================
FILE: ludwig/data/cache/manager.py
================================================
import logging
import os
from ludwig.constants import CHECKSUM, META, TEST, TRAINING, VALIDATION
from ludwig.data.cache.types import alphanum, CacheableDataset
from ludwig.data.cache.util import calculate_checksum
from ludwig.data.dataset.base import DatasetManager
from ludwig.utils import data_utils
from ludwig.utils.fs_utils import delete, path_exists
logger = logging.getLogger(__name__)
class DatasetCache:
def __init__(self, config, checksum, cache_map, dataset_manager):
self.config = config
self.checksum = checksum
self.cache_map = cache_map
self.dataset_manager = dataset_manager
def get(self):
training_set_metadata_fp = self.cache_map[META]
if not path_exists(training_set_metadata_fp):
return None
try:
cached_training_set_metadata = data_utils.load_json(training_set_metadata_fp)
except Exception:
logger.exception(f"Failed to load cached training set metadata at {training_set_metadata_fp}")
return None
cached_training_set = self.cache_map[TRAINING] if path_exists(self.cache_map[TRAINING]) else None
if not cached_training_set:
logger.warning(f"Failed to load cached training set at {self.cache_map[TRAINING]}")
cached_validation_set = self.cache_map[VALIDATION] if path_exists(self.cache_map[VALIDATION]) else None
if not cached_validation_set:
logger.warning(f"Failed to load cached validation set at {self.cache_map[VALIDATION]}")
cached_test_set = self.cache_map[TEST] if path_exists(self.cache_map[TEST]) else None
if not cached_test_set:
logger.warning(f"Failed to load cached test set at {self.cache_map[TEST]}")
valid = self.checksum == cached_training_set_metadata.get(CHECKSUM) and cached_training_set is not None
return valid, cached_training_set_metadata, cached_training_set, cached_test_set, cached_validation_set
def put(self, training_set, test_set, validation_set, training_set_metadata):
logger.info(f"Writing preprocessed training set cache to {self.cache_map[TRAINING]}")
training_set = self.dataset_manager.save(
self.cache_map[TRAINING],
training_set,
self.config,
training_set_metadata,
TRAINING,
)
if validation_set is not None:
logger.info(f"Writing preprocessed validation set cache to {self.cache_map[VALIDATION]}")
validation_set = self.dataset_manager.save(
self.cache_map[VALIDATION],
validation_set,
self.config,
training_set_metadata,
VALIDATION,
)
if test_set is not None:
logger.info(f"Writing preprocessed test set cache to {self.cache_map[TEST]}")
test_set = self.dataset_manager.save(
self.cache_map[TEST],
test_set,
self.config,
training_set_metadata,
TEST,
)
logger.info(f"Writing train set metadata to {self.cache_map[META]}")
data_utils.save_json(self.cache_map[META], training_set_metadata)
return training_set, test_set, validation_set, training_set_metadata
def delete(self):
for fname in self.cache_map.values():
if path_exists(fname):
# Parquet entries in the cache_ma can be pointers to directories.
delete(fname, recursive=True)
def get_cached_obj_path(self, cached_obj_name: str) -> str:
return self.cache_map.get(cached_obj_name)
class CacheManager:
def __init__(
self,
dataset_manager: DatasetManager,
cache_dir: str | None = None,
):
self._dataset_manager = dataset_manager
self._cache_dir = cache_dir
def get_dataset_cache(
self,
config: dict,
dataset: CacheableDataset | None = None,
training_set: CacheableDataset | None = None,
test_set: CacheableDataset | None = None,
validation_set: CacheableDataset | None = None,
) -> DatasetCache:
if dataset is not None:
key = self.get_cache_key(dataset, config)
cache_map = {
META: self.get_cache_path(dataset, key, META, "json"),
TRAINING: self.get_cache_path(dataset, key, TRAINING),
TEST: self.get_cache_path(dataset, key, TEST),
VALIDATION: self.get_cache_path(dataset, key, VALIDATION),
}
return DatasetCache(config, key, cache_map, self._dataset_manager)
else:
key = self.get_cache_key(training_set, config)
cache_map = {
META: self.get_cache_path(training_set, key, META, "json"),
TRAINING: self.get_cache_path(training_set, key, TRAINING),
TEST: self.get_cache_path(test_set, key, TEST),
VALIDATION: self.get_cache_path(validation_set, key, VALIDATION),
}
return DatasetCache(config, key, cache_map, self._dataset_manager)
def get_cache_key(self, dataset: CacheableDataset, config: dict) -> str:
return calculate_checksum(dataset, config)
def get_cache_path(self, dataset: CacheableDataset | None, key: str, tag: str, ext: str | None = None) -> str:
if self._cache_dir is None and dataset is not None:
# Use the input dataset filename (minus the extension) as the cache path
stem = dataset.get_cache_path()
else:
# To avoid collisions across different directories, we use the unique checksum
# as the cache path
stem = alphanum(key)
ext = ext or self.data_format
cache_fname = f"{stem}.{tag}.{ext}"
return os.path.join(self.get_cache_directory(dataset), cache_fname)
def get_cache_directory(self, dataset: CacheableDataset | None) -> str:
if self._cache_dir is None:
if dataset is None:
return os.getcwd()
return dataset.get_cache_directory()
return self._cache_dir
def can_cache(self, skip_save_processed_input: bool) -> bool:
return self._dataset_manager.can_cache(skip_save_processed_input)
@property
def data_format(self) -> str:
return self._dataset_manager.data_format
================================================
FILE: ludwig/data/cache/types.py
================================================
#! /usr/bin/env python
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import re
import uuid
from abc import ABC, abstractmethod
from dataclasses import dataclass
from pathlib import Path
from typing import Union
from ludwig.api_annotations import DeveloperAPI
from ludwig.utils.fs_utils import checksum
from ludwig.utils.types import DataFrame
def alphanum(v):
"""Filters a string to only its alphanumeric characters."""
return re.sub(r"\W+", "", v)
@DeveloperAPI
class CacheableDataset(ABC):
name: str
checksum: str
@abstractmethod
def get_cache_path(self) -> str:
raise NotImplementedError()
@abstractmethod
def get_cache_directory(self) -> str:
raise NotImplementedError()
@abstractmethod
def unwrap(self) -> str | DataFrame:
raise NotImplementedError()
@DeveloperAPI
@dataclass
class CacheableDataframe(CacheableDataset):
df: DataFrame
name: str
checksum: str
def get_cache_path(self) -> str:
return alphanum(self.name)
def get_cache_directory(self) -> str:
return os.getcwd()
def unwrap(self) -> str | DataFrame:
return self.df
@DeveloperAPI
@dataclass
class CacheablePath(CacheableDataset):
path: str
@property
def name(self) -> str:
return Path(self.path).stem
@property
def checksum(self) -> str:
return checksum(self.path)
def get_cache_path(self) -> str:
return self.name
def get_cache_directory(self) -> str:
return os.path.dirname(self.path)
def unwrap(self) -> str | DataFrame:
return self.path
CacheInput = Union[str, DataFrame, CacheableDataset]
def wrap(dataset: CacheInput | None) -> CacheableDataset:
if dataset is None:
return None
if isinstance(dataset, CacheableDataset):
return dataset
if isinstance(dataset, str):
return CacheablePath(path=dataset)
# TODO(travis): could try hashing the in-memory dataset, but this is tricky for Dask
checksum = str(uuid.uuid1())
name = checksum
return CacheableDataframe(df=dataset, name=name, checksum=checksum)
================================================
FILE: ludwig/data/cache/util.py
================================================
import ludwig
from ludwig.constants import DEFAULTS, INPUT_FEATURES, OUTPUT_FEATURES, PREPROCESSING, PROC_COLUMN, TYPE
from ludwig.data.cache.types import CacheableDataset
from ludwig.types import ModelConfigDict
from ludwig.utils.data_utils import hash_dict
def calculate_checksum(original_dataset: CacheableDataset, config: ModelConfigDict):
"""Calculates a checksum for a dataset and model config.
The checksum is used to determine if the dataset and model config have changed since the last time the model was
trained. If either has changed, a different checksum will be produced which will lead to a cache miss and force
preprocessing to be performed again.
"""
features = config.get(INPUT_FEATURES, []) + config.get(OUTPUT_FEATURES, []) + config.get("features", [])
info = {
"ludwig_version": ludwig.globals.LUDWIG_VERSION,
"dataset_checksum": original_dataset.checksum,
"global_preprocessing": config.get(PREPROCESSING, {}),
"global_defaults": config.get(DEFAULTS, {}),
# PROC_COLUMN contains both the feature name and the feature hash that is computed
# based on each feature's preprocessing parameters and the feature's type.
# creating a sorted list out of the dict because hash_dict requires all values
# of the dict to be ordered object to ensure the creation fo the same hash
"feature_proc_columns": sorted({feature[PROC_COLUMN] for feature in features}),
"feature_types": [feature[TYPE] for feature in features],
"feature_preprocessing": [feature.get(PREPROCESSING, {}) for feature in features],
}
# LLM-specific params
if "prompt" in config:
info["prompt"] = config["prompt"]
return hash_dict(info, max_length=None).decode("ascii")
================================================
FILE: ludwig/data/concatenate_datasets.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import numpy as np
from ludwig.backend import LOCAL_BACKEND
from ludwig.constants import SPLIT
from ludwig.utils.data_utils import read_csv
logger = logging.getLogger(__name__)
def concatenate_csv(train_csv, vali_csv, test_csv, output_csv):
concatenated_df = concatenate_files(train_csv, vali_csv, test_csv, read_csv, LOCAL_BACKEND)
logger.info("Saving concatenated dataset as csv..")
concatenated_df.to_csv(output_csv, encoding="utf-8", index=False)
logger.info("done")
def concatenate_files(train_fname, vali_fname, test_fname, read_fn, backend):
df_lib = backend.df_engine.df_lib
logger.info("Loading training file...")
train_df = read_fn(train_fname, df_lib)
logger.info("done")
logger.info("Loading validation file..")
vali_df = read_fn(vali_fname, df_lib) if vali_fname is not None else None
logger.info("done")
logger.info("Loading test file..")
test_df = read_fn(test_fname, df_lib) if test_fname is not None else None
logger.info("done")
logger.info("Concatenating files..")
concatenated_df = concatenate_df(train_df, vali_df, test_df, backend)
logger.info("done")
return concatenated_df
def concatenate_df(train_df, vali_df, test_df, backend):
train_size = len(train_df)
vali_size = len(vali_df) if vali_df is not None else 0
concatenated_df = backend.df_engine.df_lib.concat(
[df for df in [train_df, vali_df, test_df] if df is not None], ignore_index=True
)
def get_split(idx):
if idx < train_size:
return 0
if idx < train_size + vali_size:
return 1
return 2
concatenated_df[SPLIT] = concatenated_df.index.to_series().map(get_split).astype(np.int8)
return concatenated_df
def concatenate_splits(train_df, vali_df, test_df, backend):
def to_frame(df, split):
if df is None:
return None
df = df.index.to_frame(name=SPLIT)
df[SPLIT] = split
return df
dfs = [train_df, vali_df, test_df]
dfs = [to_frame(df, split) for split, df in enumerate(dfs)]
return backend.df_engine.df_lib.concat([df for df in dfs if df is not None])
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Concatenate train validation and test set")
parser.add_argument("-train", "--train_csv", help="CSV containing the training set")
parser.add_argument("-vali", "--vali_csv", help="CSV containing the validation set")
parser.add_argument("-test", "--test_csv", help="CSV containing the test set")
parser.add_argument("-o", "--output_csv", help="output csv")
args = parser.parse_args()
concatenate_csv(args.train_csv, args.vali_csv, args.test_csv, args.output_csv)
================================================
FILE: ludwig/data/dataframe/__init__.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
================================================
FILE: ludwig/data/dataframe/base.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from abc import ABC, abstractmethod
from ludwig.utils.types import DataFrame
class DataFrameEngine(ABC):
@abstractmethod
def df_like(self, df, proc_cols):
raise NotImplementedError()
@abstractmethod
def parallelize(self, data):
raise NotImplementedError()
@abstractmethod
def persist(self, data):
raise NotImplementedError()
@abstractmethod
def compute(self, data):
raise NotImplementedError()
@abstractmethod
def from_pandas(self, df):
raise NotImplementedError()
@abstractmethod
def map_objects(self, series, map_fn, meta=None):
raise NotImplementedError()
@abstractmethod
def map_partitions(self, series, map_fn, meta=None):
raise NotImplementedError()
@abstractmethod
def map_batches(self, df, map_fn, enable_tensor_extension_casting=True):
raise NotImplementedError()
@abstractmethod
def apply_objects(self, series, map_fn, meta=None):
raise NotImplementedError()
@abstractmethod
def reduce_objects(self, series, reduce_fn):
raise NotImplementedError()
@abstractmethod
def split(self, df, probabilities):
"""Splits the input DataFrame into sections with the given proportions."""
raise NotImplementedError()
@abstractmethod
def to_parquet(self, df, path, index=False):
"""Write the input DataFrame to the path in the Parquet format.
Optionally includes the DataFrame index in the Parquet file.
"""
raise NotImplementedError()
@abstractmethod
def write_predictions(self, df: DataFrame, path: str):
raise NotImplementedError()
@abstractmethod
def read_predictions(self, path: str) -> DataFrame:
raise NotImplementedError()
@abstractmethod
def to_ray_dataset(self, df):
raise NotImplementedError()
@property
@abstractmethod
def array_lib(self):
raise NotImplementedError()
@property
@abstractmethod
def df_lib(self):
raise NotImplementedError()
@property
@abstractmethod
def partitioned(self):
raise NotImplementedError()
@abstractmethod
def set_parallelism(self, parallelism):
raise NotImplementedError()
================================================
FILE: ludwig/data/dataframe/dask.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import collections
import logging
from contextlib import contextmanager
import dask
import dask.array as da
import dask.dataframe as dd
import ray
from dask.diagnostics import ProgressBar
from packaging import version
from pyarrow.fs import FSSpecHandler, PyFileSystem
from ray.data import Dataset, read_parquet
from ludwig.api_annotations import DeveloperAPI
from ludwig.data.dataframe.base import DataFrameEngine
from ludwig.utils.data_utils import get_pa_schema, get_parquet_filename, split_by_slices
from ludwig.utils.dataframe_utils import set_index_name
from ludwig.utils.fs_utils import get_fs_and_path
TMP_COLUMN = "__TMP_COLUMN__"
# This is to be compatible with pyarrow.lib.schema
PandasBlockSchema = collections.namedtuple("PandasBlockSchema", ["names", "types"])
logger = logging.getLogger(__name__)
_ray_230 = version.parse(ray.__version__) >= version.parse("2.3.0")
@DeveloperAPI
def set_scheduler(scheduler):
dask.config.set(scheduler=scheduler)
@DeveloperAPI
def reset_index_across_all_partitions(df):
"""Compute a monotonically increasing index across all partitions.
This differs from dd.reset_index, which computes an independent index for each partition.
Source: https://stackoverflow.com/questions/61395351/how-to-reset-index-on-concatenated-dataframe-in-dask
"""
# Create temporary column of ones
df = df.assign(**{TMP_COLUMN: 1})
# Set the index to the cumulative sum of TMP_COLUMN, which we know to be sorted; this improves efficiency.
df = df.set_index(df[TMP_COLUMN].cumsum() - 1, sorted=True)
# Drop temporary column and ensure the index is not named TMP_COLUMN
df = df.drop(columns=TMP_COLUMN)
df = df.map_partitions(lambda pd_df: set_index_name(pd_df, None))
return df
@DeveloperAPI
class DaskEngine(DataFrameEngine):
def __init__(self, parallelism=None, persist=True, _use_ray=True, **kwargs):
from ray.util.dask import ray_dask_get
self._parallelism = parallelism
self._persist = persist
if _use_ray:
set_scheduler(ray_dask_get)
def set_parallelism(self, parallelism):
self._parallelism = parallelism
def df_like(self, df: dd.DataFrame, proc_cols: dict[str, dd.Series]):
"""Outer joins the given DataFrame with the given processed columns.
NOTE: If any of the processed columns have been repartitioned, the original index is replaced with a
monotonically increasing index, which is used to define the new divisions and align the various partitions.
"""
# Our goal is to preserve the index of the input dataframe but to drop
# all its columns. Because to_frame() creates a column from the index,
# we need to drop it immediately following creation.
dataset = df.index.to_frame(name=TMP_COLUMN).drop(columns=TMP_COLUMN)
repartitioned_cols = {}
for k, v in proc_cols.items():
if v.npartitions == dataset.npartitions:
# Outer join cols with equal partitions.
# Dask aligns by index automatically, so no need to force divisions.
dataset[k] = v
else:
# If partitions have changed (e.g. due to conversion from Ray dataset), we handle separately
repartitioned_cols[k] = v
# Assumes that there is a globally unique index (see preprocessing.build_dataset)
if repartitioned_cols:
if not dataset.known_divisions:
# Sometimes divisions are unknown despite having a usable index– set_index to know divisions
dataset = dataset.assign(**{TMP_COLUMN: dataset.index})
dataset = dataset.set_index(TMP_COLUMN, drop=True)
dataset = dataset.map_partitions(lambda pd_df: set_index_name(pd_df, dataset.index.name))
# Find the divisions of the column with the largest number of partitions
proc_col_with_max_npartitions = max(repartitioned_cols.values(), key=lambda x: x.npartitions)
new_divisions = proc_col_with_max_npartitions.divisions
# Repartition all columns to have the same divisions
dataset = dataset.repartition(divisions=new_divisions)
repartitioned_cols = {k: v.repartition(divisions=new_divisions) for k, v in repartitioned_cols.items()}
# Outer join the remaining columns
for k, v in repartitioned_cols.items():
dataset[k] = v
return dataset
def parallelize(self, data):
if self.parallelism:
return data.repartition(npartitions=self.parallelism)
return data
def persist(self, data):
# No graph optimizations to prevent dropping custom annotations
# https://github.com/dask/dask/issues/7036
return data.persist(optimize_graph=False) if self._persist else data
def concat(self, dfs):
return self.df_lib.concat(dfs)
def compute(self, data):
return data.compute()
def from_pandas(self, df):
parallelism = self._parallelism or 1
return dd.from_pandas(df, npartitions=parallelism)
def map_objects(self, series, map_fn, meta=None):
meta = meta if meta is not None else (series.name, "object")
return series.map(map_fn, meta=meta)
def map_partitions(self, series, map_fn, meta=None):
meta = meta if meta is not None else (series.name, "object")
return series.map_partitions(map_fn, meta=meta)
def map_batches(self, series, map_fn, enable_tensor_extension_casting=True):
"""Map a function over batches of a Dask Series.
Args:
series: Dask Series
map_fn: Function to apply to each batch
enable_tensor_extension_casting: Whether to enable tensor extension casting at the end of the Ray Datasets
map_batches call. This is useful in cases where the output is not supported by the ray Tensor dtype
extension, such as when the output consists of ragged tensors.
"""
import ray.data
with tensor_extension_casting(enable_tensor_extension_casting):
ds = ray.data.from_dask(series)
ds = ds.map_batches(map_fn, batch_format="pandas")
return ds.to_dask()
def apply_objects(self, df, apply_fn, meta=None):
meta = meta if meta is not None else ("result", "object")
return df.apply(apply_fn, axis=1, meta=meta)
def reduce_objects(self, series, reduce_fn):
result = series.reduction(reduce_fn, aggregate=reduce_fn, meta=(series.name, "object")).compute()
# The result type depends on the Dask version and what reduce_fn returns.
# Access the scalar value safely regardless of return type.
if hasattr(result, "iloc"):
return result.iloc[0]
return result
def split(self, df, probabilities):
# Split the DataFrame proprotionately along partitions. This is an inexact solution designed
# to speed up the split process, as splitting within partitions would be significantly
# more expensive.
# TODO(travis): revisit in the future to make this more precise
# First ensure that every split receives at least one partition.
# If not, we need to increase the number of partitions to satisfy this constraint.
min_prob = min(probabilities)
min_partitions = int(1 / min_prob)
if df.npartitions < min_partitions:
df = df.repartition(npartitions=min_partitions)
n = df.npartitions
slices = df.partitions
return split_by_slices(slices, n, probabilities)
def remove_empty_partitions(self, df):
# Reference: https://stackoverflow.com/questions/47812785/remove-empty-partitions-in-dask
ll = list(df.map_partitions(len).compute())
if all([ll_i > 0 for ll_i in ll]):
return df
df_delayed = df.to_delayed()
df_delayed_new = list()
empty_partition = None
for ix, n in enumerate(ll):
if n == 0:
empty_partition = df.get_partition(ix)
else:
df_delayed_new.append(df_delayed[ix])
if not df_delayed_new:
# All partitions are empty, return a single empty partition
return empty_partition
df = dd.from_delayed(df_delayed_new, meta=empty_partition)
return df
def to_parquet(self, df, path, index=False):
schema = get_pa_schema(df)
with ProgressBar():
df.to_parquet(
path,
engine="pyarrow",
write_index=index,
schema=schema,
name_function=get_parquet_filename,
)
def write_predictions(self, df: dd.DataFrame, path: str):
ds = self.to_ray_dataset(df)
# We disable tensor extension casting here because we are writing out to Parquet and there is no need
# to cast to the ray Tensor dtype extension before doing so (they will be written out as object dtype as if
# we were writing to parquet using dask).
with tensor_extension_casting(False):
fs, path = get_fs_and_path(path)
ds.write_parquet(path, filesystem=PyFileSystem(FSSpecHandler(fs)))
def read_predictions(self, path: str) -> dd.DataFrame:
fs, path = get_fs_and_path(path)
ds = read_parquet(path, filesystem=PyFileSystem(FSSpecHandler(fs)))
return self.from_ray_dataset(ds)
def to_ray_dataset(self, df) -> Dataset:
from ray.data import from_dask
return from_dask(df)
def from_ray_dataset(self, dataset) -> dd.DataFrame:
# NOTE: When the dataset is an empty MapBatches(BatchInferModel), Ray's native to_dask() raises an IndexError.
try:
return dataset.to_dask()
except IndexError as e:
logging.warning(
f"Encountered an empty Dataset, {dataset.show()} with error {e}. Manually returning an empty dask "
"DataFrame."
)
return dd.DataFrame.from_dict({}, npartitions=1)
def reset_index(self, df):
return reset_index_across_all_partitions(df)
@property
def array_lib(self):
return da
@property
def df_lib(self):
return dd
@property
def parallelism(self):
return self._parallelism
@property
def partitioned(self):
return True
@contextmanager
def tensor_extension_casting(enforced: bool):
"""This context manager is used to enforce or disable tensor extension casting.
Ray Datasets will automatically cast tensor columns to the ray Tensor dtype extension at the end of
map_batches calls and before writing to Parquet. This context manager can be used to disable this behavior
and keep the tensor columns as object dtype. This is useful for writing to Parquet using dask.
Args:
enforced (bool): Whether to enforce tensor extension casting.
"""
from ray.data.context import DatasetContext
ctx = DatasetContext.get_current()
prev_enable_tensor_extension_casting = ctx.enable_tensor_extension_casting
try:
ctx.enable_tensor_extension_casting = enforced
yield
finally:
ctx.enable_tensor_extension_casting = prev_enable_tensor_extension_casting
================================================
FILE: ludwig/data/dataframe/modin.py
================================================
#! /usr/bin/env python
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import modin.pandas as pd
import numpy as np
from ludwig.data.dataframe.base import DataFrameEngine
from ludwig.globals import PREDICTIONS_SHAPES_FILE_NAME
from ludwig.utils.data_utils import get_pa_schema, load_json, save_json, split_by_slices
from ludwig.utils.dataframe_utils import flatten_df, unflatten_df
class ModinEngine(DataFrameEngine):
def __init__(self, **kwargs):
super().__init__()
def df_like(self, df, proc_cols):
# df argument unused for pandas, which can instantiate df directly
return pd.DataFrame(proc_cols)
def parallelize(self, data):
return data
def persist(self, data):
return data
def compute(self, data):
return data
def from_pandas(self, df):
return pd.DataFrame(df)
def map_objects(self, series, map_fn, meta=None):
return series.map(map_fn)
def map_batches(self, df, map_fn, enable_tensor_extension_casting=True):
return map_fn(df)
def map_partitions(self, series, map_fn, meta=None):
return map_fn(series)
def apply_objects(self, df, apply_fn, meta=None):
return df.apply(apply_fn, axis=1)
def reduce_objects(self, series, reduce_fn):
return reduce_fn(series)
def split(self, df, probabilities):
return split_by_slices(df.iloc, len(df), probabilities)
def remove_empty_partitions(self, df):
return df
def to_parquet(self, df, path, index=False):
schema = get_pa_schema(df)
df.to_parquet(
path,
engine="pyarrow",
index=index,
schema=schema,
)
def write_predictions(self, df: pd.DataFrame, path: str):
df, column_shapes = flatten_df(df, self)
self.to_parquet(df, path)
save_json(os.path.join(os.path.dirname(path), PREDICTIONS_SHAPES_FILE_NAME), column_shapes)
def read_predictions(self, path: str) -> pd.DataFrame:
pred_df = pd.read_parquet(path)
column_shapes = load_json(os.path.join(os.path.dirname(path), PREDICTIONS_SHAPES_FILE_NAME))
return unflatten_df(pred_df, column_shapes, self)
def to_ray_dataset(self, df):
from ray.data import from_modin
return from_modin(df)
def from_ray_dataset(self, dataset) -> pd.DataFrame:
return dataset.to_modin()
def reset_index(self, df):
return df.reset_index(drop=True)
@property
def array_lib(self):
return np
@property
def df_lib(self):
return pd
@property
def partitioned(self):
return False
def set_parallelism(self, parallelism):
pass
================================================
FILE: ludwig/data/dataframe/pandas.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import numpy as np
import pandas as pd
from ludwig.data.dataframe.base import DataFrameEngine
from ludwig.globals import PREDICTIONS_SHAPES_FILE_NAME
from ludwig.utils.data_utils import load_json, save_json, split_by_slices
from ludwig.utils.dataframe_utils import flatten_df, unflatten_df
class PandasEngine(DataFrameEngine):
def __init__(self, **kwargs):
super().__init__()
def df_like(self, df, proc_cols):
# df argument unused for pandas, which can instantiate df directly
return pd.DataFrame(proc_cols)
def parallelize(self, data):
return data
def persist(self, data):
return data
def compute(self, data):
return data
@staticmethod
def concat(dfs) -> pd.DataFrame:
return pd.concat(dfs)
def from_pandas(self, df):
return df
def map_objects(self, series, map_fn, meta=None):
return series.map(map_fn)
def map_batches(self, df, map_fn, enable_tensor_extension_casting=True):
return map_fn(df)
def map_partitions(self, series, map_fn, meta=None):
return map_fn(series)
def apply_objects(self, df, apply_fn, meta=None):
return df.apply(apply_fn, axis=1)
def reduce_objects(self, series, reduce_fn):
return reduce_fn(series)
def split(self, df, probabilities):
return split_by_slices(df.iloc, len(df), probabilities)
@staticmethod
def remove_empty_partitions(df: pd.DataFrame) -> pd.DataFrame:
return df
def to_parquet(self, df, path, index=False):
df.to_parquet(path, engine="pyarrow", index=index)
def write_predictions(self, df: pd.DataFrame, path: str):
df, column_shapes = flatten_df(df, self)
self.to_parquet(df, path)
save_json(os.path.join(os.path.dirname(path), PREDICTIONS_SHAPES_FILE_NAME), column_shapes)
def read_predictions(self, path: str) -> pd.DataFrame:
pred_df = pd.read_parquet(path)
column_shapes = load_json(os.path.join(os.path.dirname(path), PREDICTIONS_SHAPES_FILE_NAME))
return unflatten_df(pred_df, column_shapes, self)
def to_ray_dataset(self, df):
from ray.data import from_pandas
return from_pandas(df)
@staticmethod
def from_ray_dataset(dataset) -> pd.DataFrame:
return dataset.to_pandas()
@staticmethod
def reset_index(df) -> pd.DataFrame:
return df.reset_index(drop=True)
@property
def array_lib(self):
return np
@property
def df_lib(self):
return pd
@property
def partitioned(self):
return False
def set_parallelism(self, parallelism):
pass
PANDAS = PandasEngine()
================================================
FILE: ludwig/data/dataset/__init__.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
def get_pandas_dataset_manager(**kwargs):
from ludwig.data.dataset.pandas import PandasDatasetManager
return PandasDatasetManager(**kwargs)
def get_ray_dataset_manager(**kwargs):
from ludwig.data.dataset.ray import RayDatasetManager
return RayDatasetManager(**kwargs)
dataset_registry = {
"hdf5": get_pandas_dataset_manager,
"ray": get_ray_dataset_manager,
None: get_pandas_dataset_manager,
}
def create_dataset_manager(backend, cache_format, **kwargs):
return dataset_registry[cache_format](backend=backend, **kwargs)
================================================
FILE: ludwig/data/dataset/base.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from __future__ import annotations
import contextlib
from abc import ABC, abstractmethod
from collections.abc import Iterable
from ludwig.data.batcher.base import Batcher
from ludwig.distributed import DistributedStrategy
from ludwig.features.base_feature import BaseFeature
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.types import DataFrame
class Dataset(ABC):
@abstractmethod
def __len__(self) -> int:
raise NotImplementedError()
@contextlib.contextmanager
@abstractmethod
def initialize_batcher(
self,
batch_size: int = 128,
should_shuffle: bool = True,
random_seed: int = default_random_seed,
ignore_last: bool = False,
distributed: DistributedStrategy = None,
) -> Batcher:
raise NotImplementedError()
@abstractmethod
def to_df(self, features: Iterable[BaseFeature] | None = None) -> DataFrame:
raise NotImplementedError()
@abstractmethod
def to_scalar_df(self, features: Iterable[BaseFeature] | None = None) -> DataFrame:
raise NotImplementedError()
@property
def in_memory_size_bytes(self) -> int:
raise NotImplementedError()
class DatasetManager(ABC):
@abstractmethod
def create(self, dataset, config, training_set_metadata) -> Dataset:
raise NotImplementedError()
@abstractmethod
def save(self, cache_path, dataset, config, training_set_metadata, tag) -> Dataset:
raise NotImplementedError()
@abstractmethod
def can_cache(self, skip_save_processed_input) -> bool:
raise NotImplementedError()
@property
@abstractmethod
def data_format(self) -> str:
raise NotImplementedError()
================================================
FILE: ludwig/data/dataset/pandas.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from __future__ import annotations
import contextlib
from collections.abc import Iterable
from typing import TYPE_CHECKING
import numpy as np
from pandas import DataFrame
from ludwig.constants import PREPROCESSING, TRAINING
from ludwig.data.batcher.base import Batcher
from ludwig.data.batcher.random_access import RandomAccessBatcher
from ludwig.data.dataset.base import Dataset, DatasetManager
from ludwig.data.sampler import DistributedSampler
from ludwig.distributed import DistributedStrategy
from ludwig.features.base_feature import BaseFeature
from ludwig.utils.data_utils import DATA_TRAIN_HDF5_FP, load_hdf5, save_hdf5
from ludwig.utils.dataframe_utils import from_numpy_dataset, to_numpy_dataset, to_scalar_df
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.fs_utils import download_h5
from ludwig.utils.misc_utils import get_proc_features
if TYPE_CHECKING:
from ludwig.backend.base import Backend
class PandasDataset(Dataset):
def __init__(self, dataset, features, data_hdf5_fp):
self.features = features
self.data_hdf5_fp = data_hdf5_fp
if isinstance(dataset, str):
dataset = load_hdf5(dataset)
self.dataset = to_numpy_dataset(dataset)
self.size = len(list(self.dataset.values())[0])
def to_df(self, features: Iterable[BaseFeature] | None = None) -> DataFrame:
"""Convert the dataset to a Pandas DataFrame."""
if features:
return from_numpy_dataset({feature.feature_name: self.dataset[feature.proc_column] for feature in features})
return from_numpy_dataset(self.dataset)
def to_scalar_df(self, features: Iterable[BaseFeature] | None = None) -> DataFrame:
return to_scalar_df(self.to_df(features))
def get(self, proc_column, idx=None):
if idx is None:
idx = range(self.size)
if (
self.data_hdf5_fp is None
or PREPROCESSING not in self.features[proc_column]
or "in_memory" not in self.features[proc_column]["preprocessing"]
):
return self.dataset[proc_column][idx]
if self.features[proc_column][PREPROCESSING]["in_memory"]:
return self.dataset[proc_column][idx]
sub_batch = self.dataset[proc_column][idx]
indices = np.empty((3, len(sub_batch)), dtype=np.int64)
indices[0, :] = sub_batch
indices[1, :] = np.arange(len(sub_batch))
indices = indices[:, np.argsort(indices[0])]
with download_h5(self.data_hdf5_fp) as h5_file:
im_data = h5_file[proc_column + "_data"][indices[0, :], :, :]
indices[2, :] = np.arange(len(sub_batch))
indices = indices[:, np.argsort(indices[1])]
return im_data[indices[2, :]]
def get_dataset(self) -> dict[str, np.ndarray]:
return self.dataset
def __len__(self):
return self.size
@property
def processed_data_fp(self) -> str | None:
return self.data_hdf5_fp
@property
def in_memory_size_bytes(self) -> int:
df = self.to_df()
return df.memory_usage(deep=True).sum() if df is not None else 0
@contextlib.contextmanager
def initialize_batcher(
self,
batch_size: int = 128,
should_shuffle: bool = True,
random_seed: int = default_random_seed,
ignore_last: bool = False,
distributed: DistributedStrategy = None,
augmentation_pipeline=None,
) -> Batcher:
sampler = DistributedSampler(
len(self), shuffle=should_shuffle, random_seed=random_seed, distributed=distributed
)
batcher = RandomAccessBatcher(
self,
sampler,
batch_size=batch_size,
ignore_last=ignore_last,
augmentation_pipeline=augmentation_pipeline,
)
yield batcher
class PandasDatasetManager(DatasetManager):
def __init__(self, backend: Backend):
self.backend: Backend = backend
def create(self, dataset, config, training_set_metadata) -> Dataset:
return PandasDataset(dataset, get_proc_features(config), training_set_metadata.get(DATA_TRAIN_HDF5_FP))
def save(self, cache_path, dataset, config, training_set_metadata, tag) -> Dataset:
save_hdf5(cache_path, dataset)
if tag == TRAINING:
training_set_metadata[DATA_TRAIN_HDF5_FP] = cache_path
return dataset
def can_cache(self, skip_save_processed_input) -> bool:
return self.backend.is_coordinator() and not skip_save_processed_input
@property
def data_format(self) -> str:
return "hdf5"
================================================
FILE: ludwig/data/dataset/ray.py
================================================
#! /usr/bin/env python
# Copyright (c) 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import contextlib
import math
import queue
import threading
from functools import lru_cache
from typing import Any
import numpy as np
import pandas as pd
from pyarrow.fs import FSSpecHandler, PyFileSystem
from ray.data import Dataset as RayNativeDataset
from ray.data import read_parquet
from ray.data.extensions import TensorArray
from ludwig.backend.base import Backend
from ludwig.constants import BINARY, CATEGORY, NAME, NUMBER, TYPE
from ludwig.data.batcher.base import Batcher
from ludwig.data.dataset.base import Dataset, DatasetManager
from ludwig.utils.data_utils import DATA_TRAIN_HDF5_FP, DATA_TRAIN_PARQUET_FP
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.fs_utils import get_fs_and_path
from ludwig.utils.misc_utils import get_proc_features
from ludwig.utils.types import DataFrame, Series
_SCALAR_TYPES = {BINARY, CATEGORY, NUMBER}
def cast_as_tensor_dtype(series: Series) -> Series:
return TensorArray(series)
def read_remote_parquet(path: str):
fs, path = get_fs_and_path(path)
return read_parquet(path, filesystem=PyFileSystem(FSSpecHandler(fs)))
class RayDataset(Dataset):
"""Wrapper around ray.data.Dataset."""
def __init__(
self,
df: str | DataFrame,
features: dict[str, dict],
training_set_metadata: dict[str, Any],
backend: Backend,
):
self.df_engine = backend.df_engine
self.ds = self.df_engine.to_ray_dataset(df) if not isinstance(df, str) else read_remote_parquet(df)
self.features = features
self.training_set_metadata = training_set_metadata
self.data_hdf5_fp = training_set_metadata.get(DATA_TRAIN_HDF5_FP)
self.data_parquet_fp = training_set_metadata.get(DATA_TRAIN_PARQUET_FP)
def to_ray_dataset(
self,
shuffle: bool = True,
shuffle_seed: int = default_random_seed,
) -> RayNativeDataset:
"""Returns a ray.data.Dataset, optionally shuffled.
In modern Ray (2.5+), datasets use lazy execution by default, so there's no need for explicit windowing or
pipelining.
"""
ds = self.ds
if shuffle:
ds = ds.random_shuffle(seed=shuffle_seed)
return ds
@contextlib.contextmanager
def initialize_batcher(self, batch_size=128, should_shuffle=True, random_seed=0, ignore_last=False, **kwargs):
ds = self.ds
if should_shuffle:
ds = ds.random_shuffle(seed=random_seed)
yield RayDatasetBatcher(
ds,
self.features,
self.training_set_metadata,
batch_size,
self.size,
)
def __len__(self):
return self.ds.count()
@property
def size(self):
return len(self)
@property
def in_memory_size_bytes(self):
return self.ds.size_bytes() if self.ds is not None else 0
def to_df(self, features=None):
return self.df_engine.from_ray_dataset(self.ds)
def to_scalar_df(self, features=None):
from ludwig.utils.dataframe_utils import to_scalar_df
return to_scalar_df(self.to_df(features))
class RayDatasetManager(DatasetManager):
def __init__(self, backend):
self.backend = backend
def create(self, dataset: str | DataFrame, config: dict[str, Any], training_set_metadata: dict[str, Any]):
return RayDataset(dataset, get_proc_features(config), training_set_metadata, self.backend)
def save(
self,
cache_path: str,
dataset: DataFrame,
config: dict[str, Any],
training_set_metadata: dict[str, Any],
tag: str,
):
self.backend.df_engine.to_parquet(dataset, cache_path)
return cache_path
def can_cache(self, skip_save_processed_input):
return not skip_save_processed_input
@property
def data_format(self):
return "parquet"
class RayDatasetShard(Dataset):
"""Wraps a Ray DataIterator (from ray.train.get_dataset_shard) for distributed training."""
def __init__(
self,
dataset_shard,
features: dict[str, dict],
training_set_metadata: dict[str, Any],
):
self.dataset_shard = dataset_shard
self.features = features
self.training_set_metadata = training_set_metadata
@contextlib.contextmanager
def initialize_batcher(self, batch_size=128, should_shuffle=True, random_seed=0, ignore_last=False, **kwargs):
yield RayDatasetShardBatcher(
self.dataset_shard,
self.features,
self.training_set_metadata,
batch_size,
self.size,
)
@lru_cache(1)
def __len__(self):
# TODO(travis): find way to avoid calling this, as it's expensive
# DataIterator doesn't have a direct count method; use iter to count
count = 0
for batch in self.dataset_shard.iter_batches(batch_size=4096, batch_format="pandas"):
count += len(batch)
return count
@property
def size(self):
return len(self)
def to_df(self, features=None):
raise NotImplementedError("RayDatasetShard does not support to_df; use full RayDataset instead.")
def to_scalar_df(self, features=None):
raise NotImplementedError("RayDatasetShard does not support to_scalar_df; use full RayDataset instead.")
class _BaseBatcher(Batcher):
"""Shared batching logic for preparing batches from pandas DataFrames."""
def __init__(
self,
features: dict[str, dict],
training_set_metadata: dict[str, Any],
batch_size: int,
samples_per_epoch: int,
):
self.batch_size = batch_size
self.samples_per_epoch = samples_per_epoch
self.training_set_metadata = training_set_metadata
self.features = features
self.columns = list(features.keys())
self.reshape_map = {
proc_column: training_set_metadata[feature[NAME]].get("reshape")
for proc_column, feature in features.items()
}
self.dataset_batch_iter = None
self._epoch = 0
self._next_batch = None
self._last_batch = False
self._step = 0
def next_batch(self):
if self.last_batch():
raise StopIteration()
batch = self._next_batch
self._fetch_next_batch()
self._step += 1
return batch
def last_batch(self):
return self._last_batch
def set_epoch(self, epoch, batch_size):
self.batch_size = batch_size
if epoch != self._epoch:
self._fetch_next_epoch()
self._epoch = epoch
@property
def step(self):
return self._step
@property
def steps_per_epoch(self):
return math.ceil(self.samples_per_epoch / self.batch_size)
def _fetch_next_batch(self):
if self.dataset_batch_iter is None:
self._last_batch = True
return
self._last_batch = False
try:
self._next_batch = next(self.dataset_batch_iter)
except StopIteration:
self._last_batch = True
def _fetch_next_epoch(self):
raise NotImplementedError
def _to_tensors_fn(self):
columns = self.columns
features = self.features
def to_tensors(df: pd.DataFrame) -> pd.DataFrame:
for c in columns:
# do not convert scalar columns: https://github.com/ray-project/ray/issues/20825
if features[c][TYPE] not in _SCALAR_TYPES:
df[c] = cast_as_tensor_dtype(df[c])
elif features[c][TYPE] == BINARY:
df[c] = df[c].astype(np.bool_)
return df
return to_tensors
def _prepare_batch(self, batch: pd.DataFrame) -> dict[str, np.ndarray]:
res = {}
for c in self.columns:
if self.features[c][TYPE] not in _SCALAR_TYPES:
res[c] = np.stack(batch[c].values)
else:
res[c] = batch[c].to_numpy()
for c in self.columns:
reshape = self.reshape_map.get(c)
if reshape is not None:
res[c] = res[c].reshape((-1, *reshape))
return res
class RayDatasetBatcher(_BaseBatcher):
"""Batcher for a full ray.data.Dataset (used by non-distributed/local Ray training)."""
def __init__(
self,
dataset: RayNativeDataset,
features: dict[str, dict],
training_set_metadata: dict[str, Any],
batch_size: int,
samples_per_epoch: int,
):
self.dataset = dataset
super().__init__(features, training_set_metadata, batch_size, samples_per_epoch)
self._fetch_next_epoch()
def _fetch_next_epoch(self):
"""Create an async reader over the dataset for one epoch."""
self.dataset_batch_iter = self._create_async_reader(self.dataset)
self._step = 0
self._fetch_next_batch()
def _create_async_reader(self, dataset: RayNativeDataset):
q = queue.Queue(maxsize=100)
batch_size = self.batch_size
to_tensors = self._to_tensors_fn()
def producer():
for batch in dataset.map_batches(to_tensors, batch_format="pandas").iter_batches(
prefetch_batches=1, batch_size=batch_size, batch_format="pandas"
):
res = self._prepare_batch(batch)
q.put(res)
q.put(None)
def async_read():
t = threading.Thread(target=producer)
t.start()
while True:
batch = q.get(block=True)
if batch is None:
break
yield batch
t.join()
return async_read()
class RayDatasetShardBatcher(_BaseBatcher):
"""Batcher for a Ray DataIterator shard (used in distributed training workers)."""
def __init__(
self,
data_iterator,
features: dict[str, dict],
training_set_metadata: dict[str, Any],
batch_size: int,
samples_per_epoch: int,
):
self.data_iterator = data_iterator
super().__init__(features, training_set_metadata, batch_size, samples_per_epoch)
self._fetch_next_epoch()
def _fetch_next_epoch(self):
"""Create an async reader from the DataIterator for one epoch."""
self.dataset_batch_iter = self._create_async_reader()
self._step = 0
self._fetch_next_batch()
def _create_async_reader(self):
q = queue.Queue(maxsize=100)
batch_size = self.batch_size
to_tensors = self._to_tensors_fn()
def producer():
for batch in self.data_iterator.iter_batches(
batch_size=batch_size,
batch_format="pandas",
prefetch_batches=1,
):
batch = to_tensors(batch)
res = self._prepare_batch(batch)
q.put(res)
q.put(None)
def async_read():
t = threading.Thread(target=producer)
t.start()
while True:
batch = q.get(block=True)
if batch is None:
break
yield batch
t.join()
return async_read()
================================================
FILE: ludwig/data/dataset_synthesizer.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import os
import random
import string
import sys
import uuid
import numpy as np
import pandas as pd
import torch
import torchaudio
import yaml
from packaging import version
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
AUDIO,
BAG,
BINARY,
CATEGORY,
CATEGORY_DISTRIBUTION,
DATE,
DECODER,
ENCODER,
H3,
IMAGE,
INPUT_FEATURES,
NAME,
NUMBER,
OUTPUT_FEATURES,
PREPROCESSING,
SEQUENCE,
SET,
TEXT,
TIMESERIES,
TYPE,
VECTOR,
)
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.types import ModelConfigDict
from ludwig.utils.data_utils import save_csv
from ludwig.utils.h3_util import components_to_h3
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.print_utils import print_ludwig
logger = logging.getLogger(__name__)
_TORCH_AUDIO_210 = version.parse(torchaudio.__version__) >= version.parse("2.1.0")
letters = string.ascii_letters
DATETIME_FORMATS = {
"%m-%d-%Y": "{m:02d}-{d:02d}-{Y:04d}",
"%m-%d-%Y %H:%M:%S": "{m:02d}-{d:02d}-{Y:04d} {H:02d}:{M:02d}:{S:02d}",
"%m/%d/%Y": "{m:02d}/{d:02d}/{Y:04d}",
"%m/%d/%Y %H:%M:%S": "{m:02d}/{d:02d}/{Y:04d} {H:02d}:{M:02d}:{S:02d}",
"%m-%d-%y": "{m:02d}-{d:02d}-{y:02d}",
"%m-%d-%y %H:%M:%S": "{m:02d}-{d:02d}-{y:02d} {H:02d}:{M:02d}:{S:02d}",
"%m/%d/%y": "{m:02d}/{d:02d}/{y:02d}",
"%m/%d/%y %H:%M:%S": "{m:02d}/{d:02d}/{y:02d} {H:02d}:{M:02d}:{S:02d}",
"%d-%m-%Y": "{d:02d}-{m:02d}-{Y:04d}",
"%d-%m-%Y %H:%M:%S": "{d:02d}-{m:02d}-{Y:04d} {H:02d}:{M:02d}:{S:02d}",
"%d/%m/%Y": "{d:02d}/{m:02d}/{Y:04d}",
"%d/%m/%Y %H:%M:%S": "{d:02d}/{m:02d}/{Y:04d} {H:02d}:{M:02d}:{S:02d}",
"%d-%m-%y": "{d:02d}-{m:02d}-{y:02d}",
"%d-%m-%y %H:%M:%S": "{d:02d}-{m:02d}-{y:02d} {H:02d}:{M:02d}:{S:02d}",
"%d/%m/%y": "{d:02d}/{m:02d}/{y:02d}",
"%d/%m/%y %H:%M:%S": "{d:02d}/{m:02d}/{y:02d} {H:02d}:{M:02d}:{S:02d}",
"%y-%m-%d": "{y:02d}-{m:02d}-{d:02d}",
"%y-%m-%d %H:%M:%S": "{y:02d}-{m:02d}-{d:02d} {H:02d}:{M:02d}:{S:02d}",
"%y/%m/%d": "{y:02d}/{m:02d}/{d:02d}",
"%y/%m/%d %H:%M:%S": "{y:02d}/{m:02d}/{d:02d} {H:02d}:{M:02d}:{S:02d}",
"%Y-%m-%d": "{Y:04d}-{m:02d}-{d:02d}",
"%Y-%m-%d %H:%M:%S": "{Y:04d}-{m:02d}-{d:02d} {H:02d}:{M:02d}:{S:02d}",
"%Y/%m/%d": "{Y:04d}/{m:02d}/{d:02d}",
"%Y/%m/%d %H:%M:%S": "{Y:04d}/{m:02d}/{d:02d} {H:02d}:{M:02d}:{S:02d}",
"%y-%d-%m": "{y:02d}-{d:02d}-{m:02d}",
"%y-%d-%m %H:%M:%S": "{y:02d}-{d:02d}-{m:02d} {H:02d}:{M:02d}:{S:02d}",
"%y/%d/%m": "{y:02d}/{d:02d}/{m:02d}",
"%y/%d/%m %H:%M:%S": "{y:02d}/{d:02d}/{m:02d} {H:02d}:{M:02d}:{S:02d}",
"%Y-%d-%m": "{Y:04d}-{d:02d}-{m:02d}",
"%Y-%d-%m %H:%M:%S": "{Y:04d}-{d:02d}-{m:02d} {H:02d}:{M:02d}:{S:02d}",
"%Y/%d/%m": "{Y:04d}/{d:02d}/{m:02d}",
"%Y/%d/%m %H:%M:%S": "{Y:04d}/{d:02d}/{m:02d} {H:02d}:{M:02d}:{S:02d}",
}
def _get_feature_encoder_or_decoder(feature):
"""Returns the nested decoder or encoder dictionary for a feature.
If neither encoder nor decoder is present, creates an empty encoder dict and returns it.
"""
if DECODER in feature:
return feature[DECODER]
elif ENCODER in feature:
return feature[ENCODER]
else:
feature[ENCODER] = {}
return feature[ENCODER]
def generate_string(length):
sequence = []
for _ in range(length):
sequence.append(random.choice(letters))
return "".join(sequence)
def build_vocab(size):
vocab = []
for _ in range(size):
vocab.append(generate_string(random.randint(2, 10)))
return vocab
def return_none(feature):
return None
def assign_vocab(feature):
encoder_or_decoder = _get_feature_encoder_or_decoder(feature)
encoder_or_decoder["idx2str"] = build_vocab(encoder_or_decoder.get("vocab_size", 10))
encoder_or_decoder["vocab_size"] = len(encoder_or_decoder["idx2str"])
def build_feature_parameters(features):
feature_parameters = {}
for feature in features:
feature_builder_function = get_from_registry(feature[TYPE], parameters_builders_registry)
feature_parameters[feature[NAME]] = feature_builder_function(feature)
return feature_parameters
parameters_builders_registry = {
"category": assign_vocab,
"text": assign_vocab,
"number": return_none,
"binary": return_none,
"set": assign_vocab,
"bag": assign_vocab,
"sequence": assign_vocab,
"timeseries": return_none,
"image": return_none,
"audio": return_none,
"date": return_none,
"h3": return_none,
VECTOR: return_none,
CATEGORY_DISTRIBUTION: return_none,
}
@DeveloperAPI
def build_synthetic_dataset_df(dataset_size: int, config: ModelConfigDict) -> pd.DataFrame:
for feature in config[OUTPUT_FEATURES]:
if DECODER not in feature:
feature[DECODER] = {}
features = config[INPUT_FEATURES] + config[OUTPUT_FEATURES]
df = build_synthetic_dataset(dataset_size, features)
data = [next(df) for _ in range(dataset_size + 1)]
return pd.DataFrame(data[1:], columns=data[0])
@DeveloperAPI
def build_synthetic_dataset(dataset_size: int, features: list[dict], outdir: str = "."):
"""Synthesizes a dataset for testing purposes.
:param dataset_size: (int) size of the dataset
:param features: (List[dict]) list of features to generate in YAML format.
Provide a list containing one dictionary for each feature,
each dictionary must include a name, a type
and can include some generation parameters depending on the type
:param outdir: (str) Path to an output directory. Used for saving synthetic image and audio files.
Example content for features:
[
{name: text_1, type: text, vocab_size: 20, max_len: 20},
{name: text_2, type: text, vocab_size: 20, max_len: 20},
{name: category_1, type: category, vocab_size: 10},
{name: category_2, type: category, vocab_size: 15},
{name: number_1, type: number},
{name: number_2, type: number},
{name: binary_1, type: binary},
{name: binary_2, type: binary},
{name: set_1, type: set, vocab_size: 20, max_len: 20},
{name: set_2, type: set, vocab_size: 20, max_len: 20},
{name: bag_1, type: bag, vocab_size: 20, max_len: 10},
{name: bag_2, type: bag, vocab_size: 20, max_len: 10},
{name: sequence_1, type: sequence, vocab_size: 20, max_len: 20},
{name: sequence_2, type: sequence, vocab_size: 20, max_len: 20},
{name: timeseries_1, type: timeseries, max_len: 20},
{name: timeseries_2, type: timeseries, max_len: 20},
{name: date_1, type: date},
{name: date_2, type: date},
{name: h3_1, type: h3},
{name: h3_2, type: h3},
{name: vector_1, type: vector},
{name: vector_2, type: vector},
]
"""
build_feature_parameters(features)
header = []
for feature in features:
header.append(feature[NAME])
yield header
for _ in range(dataset_size):
yield generate_datapoint(features=features, outdir=outdir)
def generate_datapoint(features: list[dict], outdir: str) -> str | int | bool:
"""Returns a synthetic example containing features specified by the features spec.
`outdir` is only used for generating synthetic image and synthetic audio features. Otherwise, it is unused.
"""
datapoint = []
for feature in features:
if "cycle" in feature and feature["cycle"] is True and feature[TYPE] in cyclers_registry:
cycler_function = cyclers_registry[feature[TYPE]]
feature_value = cycler_function(feature)
else:
generator_function = get_from_registry(feature[TYPE], generators_registry)
feature_value = generator_function(feature=feature, outdir=outdir)
datapoint.append(feature_value)
return datapoint
def generate_category(feature, outdir: str | None = None) -> str:
"""Returns a random category.
`outdir` is unused.
"""
encoder_or_decoder = _get_feature_encoder_or_decoder(feature)
return random.choice(encoder_or_decoder["idx2str"])
def generate_number(feature, outdir: str | None = None) -> int:
"""Returns a random number.
`outdir` is unused.
"""
return random.uniform(feature["min"] if "min" in feature else 0, feature["max"] if "max" in feature else 1)
def generate_binary(feature, outdir: str | None = None) -> bool:
"""Returns a random boolean.
`outdir` is unused.
"""
choices = feature.get("bool2str", [False, True])
p = feature["prob"] if "prob" in feature else 0.5
return np.random.choice(choices, p=[1 - p, p])
def generate_sequence(feature, outdir: str | None = None) -> str:
"""Returns a random sequence.
`outdir` is unused.
"""
encoder_or_decoder = _get_feature_encoder_or_decoder(feature)
length = encoder_or_decoder.get("max_len", 10)
if "min_len" in encoder_or_decoder:
length = random.randint(encoder_or_decoder["min_len"], length)
sequence = [random.choice(encoder_or_decoder["idx2str"]) for _ in range(length)]
encoder_or_decoder["vocab_size"] = (
encoder_or_decoder["vocab_size"] + 4
) # For special symbols: START, STOP, PAD, UNK.
return " ".join(sequence)
def generate_set(feature, outdir: str | None = None) -> str:
"""Returns a random set.
`outdir` is unused.
"""
encoder_or_decoder = _get_feature_encoder_or_decoder(feature)
elems = []
for _ in range(random.randint(0, encoder_or_decoder.get("max_len", 3))):
elems.append(random.choice(encoder_or_decoder["idx2str"]))
return " ".join(list(set(elems)))
def generate_bag(feature, outdir: str | None = None) -> str:
"""Returns a random bag.
`outdir` is unused.
"""
encoder_or_decoder = _get_feature_encoder_or_decoder(feature)
elems = []
for _ in range(random.randint(0, encoder_or_decoder.get("max_len", 3))):
elems.append(random.choice(encoder_or_decoder["idx2str"]))
return " ".join(elems)
def generate_text(feature, outdir: str | None = None) -> str:
"""Returns random text.
`outdir` is unused.
"""
encoder_or_decoder = _get_feature_encoder_or_decoder(feature)
length = encoder_or_decoder.get("max_len", 10)
text = []
for _ in range(random.randint(length - int(length * 0.2), length)):
text.append(random.choice(encoder_or_decoder["idx2str"]))
return " ".join(text)
def generate_timeseries(feature, max_len=10, outdir: str | None = None) -> str:
"""Returns a random timeseries.
`outdir` is unused.
"""
encoder = _get_feature_encoder_or_decoder(feature)
series = []
max_len = encoder.get("max_len", max_len)
series_len = random.randint(max_len - 2, max_len) # simulates variable length
for _ in range(series_len):
series.append(str(random.uniform(encoder.get("min", 0), encoder.get("max", 1))))
return " ".join(series)
def generate_audio(feature, outdir: str) -> str:
"""Generates random audio and saves it to the outdir.
Returns the path to the directory of saved files.
"""
destination_folder = feature.get("destination_folder", outdir)
if PREPROCESSING in feature:
audio_length = feature[PREPROCESSING].get("audio_file_length_limit_in_s", 2)
else:
audio_length = feature.get("audio_file_length_limit_in_s", 1)
sampling_rate = 16000
num_samples = int(audio_length * sampling_rate)
audio = np.sin(np.arange(num_samples) / 100 * 2 * np.pi) * 2 * (np.random.random(num_samples) - 0.5)
audio_tensor = torch.tensor(np.array([audio])).type(torch.float32)
audio_filename = uuid.uuid4().hex[:10].upper() + ".wav"
if not os.path.exists(destination_folder):
os.makedirs(destination_folder)
audio_dest_path = os.path.join(destination_folder, audio_filename)
try:
if _TORCH_AUDIO_210:
torchaudio.save(audio_dest_path, audio_tensor, sample_rate=sampling_rate, backend="sox")
torchaudio.save(audio_dest_path, audio_tensor, sampling_rate)
except OSError as e:
raise OSError(f"Unable to save audio to disk: {e}")
return audio_dest_path
def generate_image(feature, outdir: str, save_as_numpy: bool = False) -> str:
"""Generates random images and saves it to the outdir.
Returns the path to the directory of saved files.
"""
save_as_numpy = feature.get("save_as_numpy", save_as_numpy)
try:
from torchvision.io import write_png
except ImportError:
logger.error(
" torchvision is not installed. "
"In order to install all image feature dependencies run "
"pip install ludwig[image]"
)
sys.exit(-1)
# Read num_channels, width, height
destination_folder = feature.get("destination_folder", outdir)
if PREPROCESSING in feature:
height = feature[PREPROCESSING].get("height", 28)
width = feature[PREPROCESSING].get("width", 28)
num_channels = feature[PREPROCESSING].get("num_channels", 1)
else:
encoder = _get_feature_encoder_or_decoder(feature)
height = encoder.get("height", 28)
width = encoder.get("width", 28)
num_channels = encoder.get("num_channels", 1)
if width <= 0 or height <= 0 or num_channels < 1:
raise ValueError("Invalid arguments for generating images")
# Create a Random Image
img = torch.randint(0, 255, (num_channels, width, height), dtype=torch.uint8)
# Generate a unique random filename
image_filename = uuid.uuid4().hex[:10].upper() + ".png"
# Save the image to disk either in a specified location/new folder
if not os.path.exists(destination_folder):
os.makedirs(destination_folder)
image_dest_path = os.path.join(destination_folder, image_filename)
try:
# save_image(torch.from_numpy(img.astype("uint8")), image_dest_path)
if save_as_numpy:
with open(image_dest_path, "wb") as f:
np.save(f, img.detach().cpu().numpy())
else:
write_png(img, image_dest_path)
except OSError as e:
raise OSError(f"Unable to save images to disk: {e}")
return image_dest_path
def generate_datetime(feature, outdir: str | None = None) -> str:
"""Generates a random date time, picking a format among different types.
If no format is specified, the first one is used.
"""
if "datetime_format" in feature:
datetime_generation_format = DATETIME_FORMATS[feature["datetime_format"]]
elif "preprocessing" in feature and "datetime_format" in feature["preprocessing"]:
datetime_generation_format = DATETIME_FORMATS[feature["preprocessing"]["datetime_format"]]
else:
datetime_generation_format = DATETIME_FORMATS[next(iter(DATETIME_FORMATS))]
y = random.randint(1, 99)
Y = random.randint(1, 9999)
m = random.randint(1, 12)
d = random.randint(1, 28)
H = random.randint(1, 12)
M = random.randint(1, 59)
S = random.randint(1, 59)
return datetime_generation_format.format(y=y, Y=Y, m=m, d=d, H=H, M=M, S=S)
def generate_h3(feature, outdir: str | None = None) -> str:
"""Returns a random h3.
`outdir` is unused.
"""
resolution = random.randint(0, 15) # valid values [0, 15]
h3_components = {
"mode": 1, # we can avoid testing other modes
"edge": 0, # only used in other modes
"resolution": resolution,
"base_cell": random.randint(0, 121), # valid values [0, 121]
# valid values [0, 7]
"cells": [random.randint(0, 7) for _ in range(resolution)],
}
return components_to_h3(h3_components)
def generate_vector(feature, outdir: str | None = None) -> str:
"""Returns a random vector.
`outdir` is unused.
"""
# Space delimited string with floating point numbers
if PREPROCESSING in feature:
vector_size = feature[PREPROCESSING].get("vector_size", 10)
else:
vector_size = feature.get("vector_size", 10)
return " ".join([str(100 * random.random()) for _ in range(vector_size)])
def generate_category_distribution(feature, outdir: str | None = None) -> str:
"""Returns a random category distribution.
`outdir` is unused.
"""
# Space delimited string with floating point numbers that sum to 1
preprocessing = feature.get(PREPROCESSING, {})
vector_size = len(preprocessing.get("vocab", ["a", "b", "c"]))
v = np.random.rand(vector_size)
v = v / v.sum()
return " ".join([str(x) for x in v])
generators_registry = {
BINARY: generate_binary,
NUMBER: generate_number,
CATEGORY: generate_category,
SET: generate_set,
BAG: generate_bag,
SEQUENCE: generate_sequence,
TEXT: generate_text,
TIMESERIES: generate_timeseries,
IMAGE: generate_image,
AUDIO: generate_audio,
H3: generate_h3,
DATE: generate_datetime,
VECTOR: generate_vector,
CATEGORY_DISTRIBUTION: generate_category_distribution,
}
category_cycle = 0
def cycle_category(feature):
global category_cycle
idx2str = feature[DECODER]["idx2str"] if DECODER in feature else feature[ENCODER]["idx2str"]
if category_cycle >= len(idx2str):
category_cycle = 0
category = idx2str[category_cycle]
category_cycle += 1
return category
binary_cycle = False
def cycle_binary(feature):
global binary_cycle
if binary_cycle:
binary_cycle = False
return True
else:
binary_cycle = True
return False
cyclers_registry = {"category": cycle_category, "binary": cycle_binary}
def cli_synthesize_dataset(dataset_size: int, features: list[dict], output_path: str, **kwargs) -> None:
"""Symthesizes a dataset for testing purposes.
:param dataset_size: (int) size of the dataset
:param features: (List[dict]) list of features to generate in YAML format.
Provide a list contaning one dictionary for each feature,
each dictionary must include a name, a type
and can include some generation parameters depending on the type
:param output_path: (str) path where to save the output CSV file
Example content for features:
[
{name: text_1, type: text, vocab_size: 20, max_len: 20},
{name: text_2, type: text, vocab_size: 20, max_len: 20},
{name: category_1, type: category, vocab_size: 10},
{name: category_2, type: category, vocab_size: 15},
{name: number_1, type: number},
{name: number_2, type: number},
{name: binary_1, type: binary},
{name: binary_2, type: binary},
{name: set_1, type: set, vocab_size: 20, max_len: 20},
{name: set_2, type: set, vocab_size: 20, max_len: 20},
{name: bag_1, type: bag, vocab_size: 20, max_len: 10},
{name: bag_2, type: bag, vocab_size: 20, max_len: 10},
{name: sequence_1, type: sequence, vocab_size: 20, max_len: 20},
{name: sequence_2, type: sequence, vocab_size: 20, max_len: 20},
{name: timeseries_1, type: timeseries, max_len: 20},
{name: timeseries_2, type: timeseries, max_len: 20},
{name: date_1, type: date},
{name: date_2, type: date},
{name: h3_1, type: h3},
{name: h3_2, type: h3},
{name: vector_1, type: vector},
{name: vector_2, type: vector},
]
"""
if dataset_size is None or features is None or output_path is None:
raise ValueError(
"Missing one or more required parameters: '--dataset_size', " "'--features' or '--output_path'"
)
dataset = build_synthetic_dataset(dataset_size, features)
save_csv(output_path, dataset)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script generates a synthetic dataset.",
prog="ludwig synthesize_dataset",
usage="%(prog)s [options]",
)
parser.add_argument("-od", "--output_path", type=str, help="output CSV file path")
parser.add_argument("-d", "--dataset_size", help="size of the dataset", type=int, default=100)
parser.add_argument(
"-f",
"--features",
default="[\
{name: text_1, type: text, vocab_size: 20, max_len: 20}, \
{name: text_2, type: text, vocab_size: 20, max_len: 20}, \
{name: category_1, type: category, vocab_size: 10}, \
{name: category_2, type: category, vocab_size: 15}, \
{name: number_1, type: number}, \
{name: number_2, type: number}, \
{name: binary_1, type: binary}, \
{name: binary_2, type: binary}, \
{name: set_1, type: set, vocab_size: 20, max_len: 20}, \
{name: set_2, type: set, vocab_size: 20, max_len: 20}, \
{name: bag_1, type: bag, vocab_size: 20, max_len: 10}, \
{name: bag_2, type: bag, vocab_size: 20, max_len: 10}, \
{name: sequence_1, type: sequence, vocab_size: 20, max_len: 20}, \
{name: sequence_2, type: sequence, vocab_size: 20, max_len: 20}, \
{name: timeseries_1, type: timeseries, max_len: 20}, \
{name: timeseries_2, type: timeseries, max_len: 20}, \
{name: date_1, type: date}, \
{name: date_2, type: date}, \
{name: h3_1, type: h3}, \
{name: h3_2, type: h3}, \
{name: vector_1, type: vector}, \
{name: vector_2, type: vector}, \
]",
type=yaml.safe_load,
help="list of features to generate in YAML format. "
"Provide a list containing one dictionary for each feature, "
"each dictionary must include a name, a type "
"and can include some generation parameters depending on the type",
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("synthesize_dataset", *sys_argv)
# No log level parameter this is placeholder if we add at later date
# args.logging_level = get_logging_level_registry[args.logging_level]
# logging.getLogger('ludwig').setLevel(
# args.logging_level
# )
# global logger
# logger = logging.getLogger('ludwig.data.dataset_synthesizer')
print_ludwig("Synthesize Dataset", LUDWIG_VERSION)
cli_synthesize_dataset(**vars(args))
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/data/negative_sampling.py
================================================
import logging
import time
from typing import Any
import numpy as np
import pandas as pd
import scipy
from ludwig.utils.types import DataFrame
def _negative_sample_user(interaction_row: np.array, neg_pos_ratio: int, extra_samples: int) -> tuple[list[int], int]:
"""Returns a list of negative item indices for given user-item interactions.
If there are not enough negative items, takes all of them and adds the difference to the extra_samples
otherwise, samples with replacement.
Params:
interaction_row: user-item interaction row
neg_pos_ratio: number of negative samples per positive sample
extra_samples: number of additional samples to add to the negative sample list
Returns:
Tuple of list of negative item indices and number of extra samples
"""
# Find all items that are not interacted with by the user
neg_items = np.where(interaction_row == 0)[1]
available_samples = len(neg_items)
# Randomly sample negative items
npos = interaction_row.shape[1] - len(neg_items)
samples_required = npos * neg_pos_ratio + extra_samples
should_sample = samples_required <= available_samples
neg_items = np.random.choice(neg_items, samples_required, replace=False) if should_sample else neg_items
return neg_items.tolist(), max(0, samples_required - available_samples)
def negative_sample(
df: DataFrame,
user_id_col: str = "customer_id",
item_id_col: str = "article_id",
label_col: str = "label",
neg_pos_ratio: int = 1,
neg_val: Any = 0,
log_pct: int = 0,
):
"""Negative sampling for implicit feedback datasets.
Params:
df: DataFrame containing user-item interactions
user_id_col: column name for user ids
item_id_col: column name for item ids
label_col: column name for interaction labels (e.g. 1 for positive interaction)
n_neg: number of negative samples per positive sample
neg_val: label value for the negative samples
percent_print: print progress every percent_print percent. 0 to disable
Returns:
Input DataFrame with negative samples appended
Source: https://petamind.com/fast-uniform-negative-sampling-for-rating-matrix/
"""
# TODO(joppe): support out of memory negative sampling using Dask
if not isinstance(df, pd.DataFrame):
df = df.compute()
# Initialize sparse COOrdinate matrix from users and items in existing interactions
user_id_cat = df[user_id_col].astype("category").cat
user_id_codes = user_id_cat.codes.values
item_id_cat = df[item_id_col].astype("category").cat
item_id_codes = item_id_cat.codes.values
interactions_sparse = scipy.sparse.coo_matrix((df[label_col], (user_id_codes, item_id_codes)))
# Convert to dense user-item matrix so we can iterate
interactions_dense = interactions_sparse.todense()
nrows = interactions_dense.shape[0]
niter_log = int(nrows * log_pct / 100)
start_time = time.time()
user_indices, item_indices = [], []
extra_samples = 0
for user_idx, interaction_row in enumerate(interactions_dense):
if log_pct > 0 and user_idx % niter_log == 0:
logging.info(
f"Negative sampling progress: {float(user_idx) * 100 / nrows:0.0f}% "
f"in {time.time() - start_time:0.2f}s"
)
neg_items_for_user, extra_samples = _negative_sample_user(interaction_row, neg_pos_ratio, extra_samples)
# Add to negative user-item pairs
item_indices += neg_items_for_user
user_indices += [user_idx] * len(neg_items_for_user)
negative_samples = pd.DataFrame(
{
# Map back to original user and item ids
user_id_col: user_id_cat.categories[user_indices],
item_id_col: item_id_cat.categories[item_indices],
label_col: [neg_val] * len(item_indices),
}
)
return pd.concat([df[[user_id_col, item_id_col, label_col]], negative_samples])
================================================
FILE: ludwig/data/postprocessing.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
from typing import Any, Optional
import numpy as np
import pandas as pd
import torch
from ludwig.backend import LOCAL_BACKEND
from ludwig.data.utils import convert_to_dict
from ludwig.utils.data_utils import DATAFRAME_FORMATS, DICT_FORMATS
from ludwig.utils.dataframe_utils import to_numpy_dataset
from ludwig.utils.fs_utils import has_remote_protocol, open_file
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.strings_utils import make_safe_filename
from ludwig.utils.types import DataFrame
def postprocess(
predictions,
output_features,
training_set_metadata,
output_directory="",
backend=LOCAL_BACKEND,
skip_save_unprocessed_output=False,
) -> DataFrame:
if not backend.is_coordinator():
# Only save unprocessed output on the coordinator
skip_save_unprocessed_output = True
saved_keys = set()
if not skip_save_unprocessed_output:
_save_as_numpy(predictions, output_directory, saved_keys, backend)
def postprocess_batch(df):
for of_name, output_feature in output_features.items():
df = output_feature.postprocess_predictions(
df,
training_set_metadata[of_name],
)
return df
# We disable tensor extension casting here because this step is the final data processing step and
# we do not expect return to Ray Datasets after this point. The dtype of the predictions will be
# whatever they would be if we did all postprocessing in Dask.
predictions = backend.df_engine.map_batches(predictions, postprocess_batch, enable_tensor_extension_casting=False)
# Save any new columns but do not save the original columns again
if not skip_save_unprocessed_output:
_save_as_numpy(predictions, output_directory, saved_keys, backend)
return predictions
def _save_as_numpy(predictions, output_directory, saved_keys, backend):
predictions = predictions[[c for c in predictions.columns if c not in saved_keys]]
npy_filename = os.path.join(output_directory, "{}.npy")
numpy_predictions = to_numpy_dataset(predictions, backend)
for k, v in numpy_predictions.items():
k = k.replace("<", "[").replace(">", "]") # Replace and with [UNK], [PAD]
if k not in saved_keys:
if has_remote_protocol(output_directory):
with open_file(npy_filename.format(make_safe_filename(k)), mode="wb") as f:
np.save(f, v)
else:
np.save(npy_filename.format(make_safe_filename(k)), v)
saved_keys.add(k)
def convert_dict_to_df(predictions: dict[str, dict[str, list[Any] | torch.Tensor | np.ndarray]]) -> pd.DataFrame:
"""Converts a dictionary of predictions into a pandas DataFrame.
Example format of predictions dictionary:
{
"binary_C82EB": {
"predictions": torch.tensor([True, True, True, False]),
"probabilities": torch.tensor([[0.4777, 0.5223], [0.4482, 0.5518], [0.4380, 0.5620], [0.5059, 0.4941]]),
},
"category_1491D": {
"predictions": ["NkNUG", "NkNUG", "NkNUG", "NkNUG"],
"probabilities": torch.tensor(
[
[0.1058, 0.4366, 0.1939, 0.2637],
[0.0816, 0.4807, 0.1978, 0.2399],
[0.0907, 0.4957, 0.1829, 0.2308],
[0.0728, 0.5015, 0.1900, 0.2357],
]
),
},
"num_7B25F": {"predictions": torch.tensor([2.0436, 2.1158, 2.1222, 2.1964])},
}
"""
output = {}
for of_name, preds_dict in predictions.items():
for key, value in preds_dict.items():
output_key = f"{of_name}_{key}"
if not isinstance(value, list):
value = value.tolist()
output[output_key] = value
return pd.DataFrame.from_dict(output)
def convert_predictions(
predictions, output_features, return_type="dict", backend: Optional["Backend"] = None # noqa: F821
):
convert_fn = get_from_registry(return_type, conversion_registry)
return convert_fn(predictions, output_features, backend)
def convert_to_df(
predictions,
output_features,
backend: Optional["Backend"] = None, # noqa: F821
):
return predictions
conversion_registry = {
**{format: convert_to_dict for format in DICT_FORMATS},
**{format: convert_to_df for format in DATAFRAME_FORMATS},
}
================================================
FILE: ludwig/data/preprocessing.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import contextlib
import logging
import warnings
from abc import ABC, abstractmethod
import numpy as np
import pandas as pd
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.backend import Backend, LOCAL_BACKEND
from ludwig.config_validation.preprocessing import check_global_max_sequence_length_fits_prompt_template
from ludwig.constants import (
BFILL,
CHECKSUM,
COLUMN,
DEFAULTS,
DROP_ROW,
ENCODER,
FFILL,
FILL_WITH_CONST,
FILL_WITH_FALSE,
FILL_WITH_MEAN,
FILL_WITH_MODE,
FILL_WITH_TRUE,
FULL,
META,
MIN_DATASET_SPLIT_ROWS,
MODEL_ECD,
NAME,
NUMBER,
PREPROCESSING,
PROC_COLUMN,
SPLIT,
SRC,
TEST,
TEXT,
TRAINING,
TYPE,
VALIDATION,
)
from ludwig.data.cache.manager import DatasetCache
from ludwig.data.cache.types import wrap
from ludwig.data.concatenate_datasets import concatenate_df, concatenate_files, concatenate_splits
from ludwig.data.dataset.base import Dataset
from ludwig.data.prompt import format_input_with_prompt, index_column
from ludwig.data.split import get_splitter, split_dataset
from ludwig.data.utils import get_input_and_output_features, set_fixed_split
from ludwig.datasets import load_dataset_uris
from ludwig.features.feature_registries import get_base_type_registry
from ludwig.models.embedder import create_embed_batch_size_evaluator, create_embed_transform_fn
from ludwig.schema.encoders.utils import get_encoder_cls
from ludwig.types import FeatureConfigDict, ModelConfigDict, PreprocessingConfigDict, TrainingSetMetadataDict
from ludwig.utils import data_utils, strings_utils
from ludwig.utils.backward_compatibility import upgrade_metadata
from ludwig.utils.data_utils import (
CACHEABLE_FORMATS,
CSV_FORMATS,
DATA_TEST_PARQUET_FP,
DATA_TRAIN_HDF5_FP,
DATA_TRAIN_PARQUET_FP,
DATA_VALIDATION_PARQUET_FP,
DATAFRAME_FORMATS,
DICT_FORMATS,
EXCEL_FORMATS,
FEATHER_FORMATS,
figure_data_format,
FWF_FORMATS,
get_split_path,
HDF5_FORMATS,
HTML_FORMATS,
JSON_FORMATS,
JSONL_FORMATS,
ORC_FORMATS,
override_in_memory_flag,
PARQUET_FORMATS,
PICKLE_FORMATS,
read_csv,
read_excel,
read_feather,
read_fwf,
read_html,
read_json,
read_jsonl,
read_orc,
read_parquet,
read_pickle,
read_sas,
read_spss,
read_stata,
read_tsv,
sanitize_column_names,
SAS_FORMATS,
SPSS_FORMATS,
STATA_FORMATS,
TSV_FORMATS,
)
from ludwig.utils.dataframe_utils import is_dask_series_or_df
from ludwig.utils.defaults import (
default_prediction_preprocessing_parameters,
default_random_seed,
default_training_preprocessing_parameters,
)
from ludwig.utils.fs_utils import file_lock, path_exists
from ludwig.utils.misc_utils import get_from_registry, merge_dict
from ludwig.utils.types import DataFrame, Series
# Opt-in to future pandas behavior: fillna/ffill/bfill will no longer silently downcast dtypes
pd.set_option("future.no_silent_downcasting", True)
REPARTITIONING_FEATURE_TYPES = {"image", "audio"}
logger = logging.getLogger(__name__)
class DataFormatPreprocessor(ABC):
@staticmethod
@abstractmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
pass
@staticmethod
@abstractmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
pass
@staticmethod
@abstractmethod
def prepare_processed_data(
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
):
pass
class DictPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
num_overrides = override_in_memory_flag(features, True)
if num_overrides > 0:
logger.warning("Using in_memory = False is not supported " "with {} data format.".format("dict"))
df_engine = backend.df_engine
if dataset is not None:
dataset = df_engine.from_pandas(pd.DataFrame(dataset))
if training_set is not None:
training_set = df_engine.from_pandas(pd.DataFrame(training_set))
if validation_set is not None:
validation_set = df_engine.from_pandas(pd.DataFrame(validation_set))
if test_set is not None:
test_set = df_engine.from_pandas(pd.DataFrame(test_set))
return _preprocess_df_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
training_set_metadata=training_set_metadata,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset, training_set_metadata = build_dataset(
config,
pd.DataFrame(dataset),
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class DataFramePreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
num_overrides = override_in_memory_flag(features, True)
if num_overrides > 0:
logger.warning("Using in_memory = False is not supported " "with {} data format.".format("dataframe"))
if isinstance(dataset, pd.DataFrame):
dataset = backend.df_engine.from_pandas(dataset)
return _preprocess_df_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
training_set_metadata=training_set_metadata,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
if isinstance(dataset, pd.DataFrame):
dataset = backend.df_engine.from_pandas(dataset)
dataset, training_set_metadata = build_dataset(
config,
dataset,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class CSVPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_csv,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_csv(dataset, df_lib=backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class TSVPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_tsv,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_tsv(dataset, df_lib=backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class JSONPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_json,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_json(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class JSONLPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_jsonl,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_jsonl(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class ExcelPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_excel,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_excel(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class ParquetPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_parquet,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_parquet(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
@staticmethod
def prepare_processed_data(
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
):
test_set = test_set if test_set and path_exists(test_set) else None
if test_set and isinstance(test_set, str) and DATA_TEST_PARQUET_FP not in training_set_metadata:
training_set_metadata[DATA_TEST_PARQUET_FP] = test_set
validation_set = validation_set if validation_set and path_exists(validation_set) else None
if (
validation_set
and isinstance(validation_set, str)
and DATA_VALIDATION_PARQUET_FP not in training_set_metadata
):
training_set_metadata[DATA_VALIDATION_PARQUET_FP] = validation_set
if training_set and isinstance(training_set, str) and DATA_TRAIN_PARQUET_FP not in training_set_metadata:
training_set_metadata[DATA_TRAIN_PARQUET_FP] = training_set
return training_set, test_set, validation_set, training_set_metadata
class PicklePreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_pickle,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_pickle(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class FatherPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_feather,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_feather(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class FWFPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_fwf,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_fwf(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class HTMLPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_html,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_html(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class ORCPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_orc,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_orc(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class SASPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_sas,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_sas(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class SPSSPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_spss,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_spss(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class StataPreprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return _preprocess_file_for_training(
config,
features,
dataset,
training_set,
validation_set,
test_set,
read_fn=read_stata,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
dataset_df = read_stata(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
dataset, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="prediction",
metadata=training_set_metadata,
backend=backend,
callbacks=callbacks,
)
return dataset, training_set_metadata, None
class HDF5Preprocessor(DataFormatPreprocessor):
@staticmethod
def preprocess_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
return HDF5Preprocessor.prepare_processed_data(
features,
dataset,
training_set,
validation_set,
test_set,
training_set_metadata,
skip_save_processed_input,
preprocessing_params,
backend,
random_seed,
)
@staticmethod
def preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
):
hdf5_fp = dataset
dataset = load_hdf5(dataset, preprocessing_params, backend, split_data=False, shuffle_training=False)
return dataset, training_set_metadata, hdf5_fp
@staticmethod
def prepare_processed_data(
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
):
if dataset is None and training_set is None:
raise ValueError("One of `dataset` or `training_set` must be not None")
not_none_set = dataset if dataset is not None else training_set
if not training_set_metadata:
raise ValueError("When providing HDF5 data, " "training_set_metadata must not be None.")
logger.info("Using full hdf5 and json")
if DATA_TRAIN_HDF5_FP not in training_set_metadata:
logger.warning(
"data_train_hdf5_fp not present in training_set_metadata. "
"Adding it with the current HDF5 file path {}".format(not_none_set)
)
training_set_metadata[DATA_TRAIN_HDF5_FP] = not_none_set
elif training_set_metadata[DATA_TRAIN_HDF5_FP] != not_none_set:
logger.warning(
"data_train_hdf5_fp in training_set_metadata is {}, "
"different from the current HDF5 file path {}. "
"Replacing it".format(training_set_metadata[DATA_TRAIN_HDF5_FP], not_none_set)
)
training_set_metadata[DATA_TRAIN_HDF5_FP] = not_none_set
if dataset is not None:
training_set, test_set, validation_set = load_hdf5(
dataset, preprocessing_params, backend, shuffle_training=True
)
elif training_set is not None:
kwargs = dict(preprocessing_params=preprocessing_params, backend=backend, split_data=False)
training_set = load_hdf5(training_set, shuffle_training=True, **kwargs)
if validation_set is not None:
validation_set = load_hdf5(validation_set, shuffle_training=False, **kwargs)
if test_set is not None:
test_set = load_hdf5(test_set, shuffle_training=False, **kwargs)
return training_set, test_set, validation_set, training_set_metadata
data_format_preprocessor_registry = {
**{fmt: DictPreprocessor for fmt in DICT_FORMATS},
**{fmt: DataFramePreprocessor for fmt in DATAFRAME_FORMATS},
**{fmt: CSVPreprocessor for fmt in CSV_FORMATS},
**{fmt: TSVPreprocessor for fmt in TSV_FORMATS},
**{fmt: JSONPreprocessor for fmt in JSON_FORMATS},
**{fmt: JSONLPreprocessor for fmt in JSONL_FORMATS},
**{fmt: ExcelPreprocessor for fmt in EXCEL_FORMATS},
**{fmt: ParquetPreprocessor for fmt in PARQUET_FORMATS},
**{fmt: PicklePreprocessor for fmt in PICKLE_FORMATS},
**{fmt: FWFPreprocessor for fmt in FWF_FORMATS},
**{fmt: FatherPreprocessor for fmt in FEATHER_FORMATS},
**{fmt: HTMLPreprocessor for fmt in HTML_FORMATS},
**{fmt: ORCPreprocessor for fmt in ORC_FORMATS},
**{fmt: SASPreprocessor for fmt in SAS_FORMATS},
**{fmt: SPSSPreprocessor for fmt in SPSS_FORMATS},
**{fmt: StataPreprocessor for fmt in STATA_FORMATS},
**{fmt: HDF5Preprocessor for fmt in HDF5_FORMATS},
}
def build_dataset(
config,
dataset_df,
features,
global_preprocessing_parameters,
mode,
metadata=None,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
skip_save_processed_input=False,
callbacks=None,
):
"""Builds a dataset from a dataframe and a list of features.
Args:
config: A dictionary containing the Ludwig model configuration
dataset_df: Pandas or Dask dataframe
features: List of features
global_preprocessing_parameters: Global preprocessing parameters
mode: One of ['training', 'prediction']
metadata: Training set metadata if available
backend: Backend
random_seed: Random seed
skip_save_processed_input: Whether to skip saving the processed input
callbacks: List of callbacks
Returns:
A tuple of (dataset, metadata)
"""
df_engine = backend.df_engine
if df_engine.partitioned:
if any(f["type"] in REPARTITIONING_FEATURE_TYPES for f in features) and dataset_df.npartitions > 1:
# A globally unique index only matters if you know that there will be a repartition downstream for some
# particular feature, i.e. for Image and Audio features on a Ray backend.
# - There is a join operation in `df_like`, and the only way to do the operation is if the partitions across
# all feature columns are aligned.
# - In order to align the partitions, we require a way of matching samples to one another across all
# partitions. Therefore, we must reset_index to create a globally unique index.
# - If the number of partitions is 1, it is *highly likely* the index is globally unique. Auto-assigned
# Dask indices in this case are unique, and we pd.concat train, val, and test sets with ignore_index=True
# If there will NOT be a repartition downstream, then we can skip this step.
# - In this case, the partitions should remain aligned throughout.
# - Further, while the indices might not be globally unique, they should be unique within each partition.
# - These two properties make it possible to do the join op within each partition without a global index.
logger.warning(
f"Dataset has {dataset_df.npartitions} partitions and feature types that cause repartitioning. "
f"Resetting index to ensure globally unique indices."
)
dataset_df = df_engine.reset_index(dataset_df)
dataset_df = df_engine.parallelize(dataset_df)
# Ensure that column names with non-word characters won't cause problems for downstream operations.
# NOTE: Must be kept consistent with config sanitization in schema/model_types/base.py.
dataset_df = sanitize_column_names(dataset_df)
if mode == "training":
sample_ratio = global_preprocessing_parameters["sample_ratio"]
sample_size = global_preprocessing_parameters["sample_size"]
dataset_df = _get_sampled_dataset_df(dataset_df, df_engine, sample_ratio, sample_size, random_seed)
# If persisting DataFrames in memory is enabled, we want to do this after
# each batch of parallel ops in order to avoid redundant computation
dataset_df = df_engine.persist(dataset_df)
if mode == "training":
default_preprocessing_parameters = default_training_preprocessing_parameters
elif mode == "prediction":
default_preprocessing_parameters = default_prediction_preprocessing_parameters
else:
raise ValueError(f"Invalid mode {mode}")
global_preprocessing_parameters = merge_dict(default_preprocessing_parameters, global_preprocessing_parameters)
split_col = None
if global_preprocessing_parameters["split"]["type"] == "fixed":
if global_preprocessing_parameters["split"]["column"] in dataset_df.columns:
split_col = dataset_df[global_preprocessing_parameters["split"]["column"]]
else:
logger.warning(
f"Specified split column {global_preprocessing_parameters['split']['column']} for fixed "
f"split strategy was not found in dataset." # noqa: E713
)
# update input features with prompt configs during preprocessing (as opposed to during the model forward pass)
# so that we can compute metadata and build the dataset correctly.
logger.debug("handle text features with prompt parameters")
synthesized_dataset_cols = handle_features_with_prompt_config(
config, dataset_df, features, split_col=split_col, backend=backend
)
# Get all the unique preprocessing features to compute
feature_configs = []
feature_hashes = set()
for feature in features:
if feature[PROC_COLUMN] not in feature_hashes:
feature_configs.append(feature)
feature_hashes.add(feature[PROC_COLUMN])
dataset_cols = {}
for feature_config in feature_configs:
col_name = feature_config[COLUMN]
dataset_cols[col_name] = (
synthesized_dataset_cols[col_name] if col_name in synthesized_dataset_cols else dataset_df[col_name]
)
logger.debug("build preprocessing parameters")
feature_name_to_preprocessing_parameters = build_preprocessing_parameters(
dataset_cols, feature_configs, global_preprocessing_parameters, backend, metadata=metadata
)
# Happens after preprocessing parameters are built, so we can use precomputed fill values.
logger.debug("handle missing values")
# In some cases, there can be a (temporary) mismatch between the dtype of the column and the type expected by the
# preprocessing config (e.g., a categorical feature represented as an int-like column). In particular, Dask
# may raise an error even when there are no missing values in the column itself.
#
# Since we immediately cast all columns in accordance with their expected feature types after filling missing
# values, we work around the above issue by temporarily treating all columns as object dtype.
for col_key in dataset_cols:
dataset_cols[col_key] = dataset_cols[col_key].astype(object)
for feature_config in feature_configs:
preprocessing_parameters = feature_name_to_preprocessing_parameters[feature_config[NAME]]
handle_missing_values(dataset_cols, feature_config, preprocessing_parameters, backend)
# Happens after missing values are handled to avoid NaN casting issues.
logger.debug("cast columns")
cast_columns(dataset_cols, feature_configs, backend)
for callback in callbacks or []:
callback.on_build_metadata_start(dataset_df, mode)
logger.debug("build metadata")
metadata: TrainingSetMetadataDict = build_metadata(
config, metadata, feature_name_to_preprocessing_parameters, dataset_cols, feature_configs, backend
)
check_global_max_sequence_length_fits_prompt_template(metadata, global_preprocessing_parameters)
for callback in callbacks or []:
callback.on_build_metadata_end(dataset_df, mode)
for callback in callbacks or []:
callback.on_build_data_start(dataset_df, mode)
logger.debug("build data")
proc_cols = build_data(dataset_cols, feature_configs, metadata, backend, skip_save_processed_input)
for callback in callbacks or []:
callback.on_build_data_end(dataset_df, mode)
# Get any additional columns needed for splitting downstream, otherwise they will not be
# included in the preprocessed output.
split_params = global_preprocessing_parameters.get(SPLIT, {})
if "type" not in split_params and SPLIT in dataset_df:
warnings.warn(
'Detected "split" column in the data, but using default split type '
'"random". Did you mean to set split type to "fixed"?'
)
splitter = get_splitter(**split_params)
for column in splitter.required_columns:
if column not in dataset_df:
warnings.warn(
f"column: '{column}' is required by the dataset splitter with params: {split_params}, but '{column}' "
f"is not present in the `dataset_df` with columns: {dataset_df.columns}. This is acceptable during "
"serving setting where dataset splitting is irrelevant. You may see this warning if, for example, the "
"model was trained with a configuration that used a stratified split on the target column, but for "
"live predictions, a value for the target column is not to be provided."
)
continue
proc_cols[column] = dataset_df[column]
# TODO pyarrow: this is needed for caching to work with pyarrow. if removed, the following error is raised:
# "pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values". The data is reshaped when loaded
# by the batcher in the RayDataset class (see _prepare_batch).
if not skip_save_processed_input and backend.cache.data_format == "parquet":
for feature in features:
name = feature[NAME]
proc_column = feature[PROC_COLUMN]
reshape = metadata[name].get("reshape")
if reshape is not None:
proc_cols[proc_column] = backend.df_engine.map_objects(proc_cols[proc_column], lambda x: x.reshape(-1))
# Implements an outer join of proc_cols
dataset = backend.df_engine.df_like(dataset_df, proc_cols)
# At this point, there should be no missing values left in the dataframe, unless
# the DROP_ROW preprocessing option was selected, in which case we need to drop those
# rows.
len_dataset_before_drop_rows = len(dataset)
dataset = dataset.dropna()
len_dataset_after_drop_rows = len(dataset)
if len_dataset_before_drop_rows != len_dataset_after_drop_rows:
logger.warning(
f"Dropped a total of {len_dataset_before_drop_rows - len_dataset_after_drop_rows} rows out of "
f"{len_dataset_before_drop_rows} due to missing values"
)
# NaNs introduced by outer join change dtype of dataset cols (upcast to float64), so we need to cast them back.
col_name_to_dtype = {}
for col_name, col in proc_cols.items():
# if col is a list of list-like objects, we assume the internal dtype of each col[i] remains unchanged.
if isinstance(col, list) and isinstance(col[0], (list, np.ndarray, torch.Tensor)):
continue
dtype = col.dtype
# Skip non-numpy extension dtypes (e.g. TensorDtype from Ray, ArrowDtype from PyArrow)
# as they cannot be used with DataFrame.astype() reliably.
if not isinstance(dtype, np.dtype):
continue
col_name_to_dtype[col_name] = dtype
dataset = dataset.astype(col_name_to_dtype)
# Persist the completed dataset with no NaNs
dataset = backend.df_engine.persist(dataset)
# Remove partitions that are empty after removing NaNs
dataset = backend.df_engine.remove_empty_partitions(dataset)
# Embed features with fixed encoders
dataset = embed_fixed_features(dataset, feature_configs, metadata, backend)
return dataset, metadata
def embed_fixed_features(
dataset: DataFrame, feature_configs: list[FeatureConfigDict], metadata: TrainingSetMetadataDict, backend: Backend
) -> DataFrame:
"""Transforms every input feature with cacheable encoder embeddings into its encoded form and updates
metadata."""
# Encode features in bulk at the end
features_to_encode = get_features_with_cacheable_fixed_embeddings(feature_configs, metadata)
if not features_to_encode:
return dataset
logger.info(f"Cache encoder embeddings for features: {[f[NAME] for f in features_to_encode]}")
for feature in features_to_encode:
# Temporarily set to False to ensure proper encoding
metadata[feature[NAME]][PREPROCESSING]["cache_encoder_embeddings"] = False
batch_size = backend.tune_batch_size(create_embed_batch_size_evaluator(features_to_encode, metadata), len(dataset))
transform_fn = create_embed_transform_fn(features_to_encode, metadata)
results = backend.batch_transform(dataset, batch_size, transform_fn, name="Caching encoder embeddings")
for feature in features_to_encode:
# Set metadata so we know to skip encoding the feature
metadata[feature[NAME]][PREPROCESSING]["cache_encoder_embeddings"] = True
return results
def _get_sampled_dataset_df(dataset_df, df_engine, sample_ratio, sample_size, random_seed):
df_len = len(dataset_df)
if sample_ratio < 1.0:
if not df_engine.partitioned and df_len * sample_ratio < 1:
raise ValueError(
f"sample_ratio {sample_ratio} is too small for dataset of length {df_len}. "
f"Please increase sample_ratio or use a larger dataset."
)
logger.debug(f"sample {sample_ratio} of data")
dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed)
if sample_size:
if sample_size < df_len:
# Cannot use 'n' parameter when using dask DataFrames -- only 'frac' is supported
sample_ratio = sample_size / df_len
dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed)
else:
logger.warning("sample_size is larger than dataset size, ignoring sample_size")
return dataset_df
def get_features_with_cacheable_fixed_embeddings(
feature_configs: list[FeatureConfigDict], metadata: TrainingSetMetadataDict
) -> list[FeatureConfigDict]:
"""Returns list of features with `cache_encoder_embeddings=True` set in the preprocessing config."""
features_to_encode = []
for feature_config in feature_configs:
# deal with encoders that have fixed preprocessing
if ENCODER in feature_config:
encoder_params = feature_config[ENCODER]
if TYPE in encoder_params:
preprocessing = metadata[feature_config[NAME]][PREPROCESSING]
if preprocessing.get("cache_encoder_embeddings"):
# TODO(travis): passing in MODEL_ECD is a hack here that can be removed once we move to using
# the config object everywhere in preprocessing. Then we won't need to do the lookup on the
# encoder schema at all.
encoder_class = get_encoder_cls(MODEL_ECD, feature_config[TYPE], encoder_params[TYPE])
encoder = encoder_class.from_dict(encoder_params)
if not encoder.can_cache_embeddings():
raise ValueError(
f"Set `cache_encoder_embeddings=True` for feature {feature_config[NAME]} with "
f"encoder {encoder_params[TYPE]}, but encoder embeddings are not static."
)
# Convert to Ray Datasets, map batches to encode, then convert back to Dask
features_to_encode.append(feature_config)
return features_to_encode
def cast_columns(dataset_cols, features, backend) -> None:
"""Casts columns based on their feature type."""
for feature in features:
# todo figure out if additional parameters are needed
# for the cast_column function
try:
dataset_cols[feature[COLUMN]] = get_from_registry(feature[TYPE], get_base_type_registry()).cast_column(
dataset_cols[feature[COLUMN]], backend
)
except KeyError as e:
raise KeyError(
f"Feature name {e} specified in the config was not found in dataset with columns: " # noqa: E713
+ f"{list(dataset_cols.keys())}"
)
def merge_preprocessing(
feature_config: FeatureConfigDict, global_preprocessing_parameters: PreprocessingConfigDict
) -> FeatureConfigDict:
if PREPROCESSING not in feature_config:
return global_preprocessing_parameters[feature_config[TYPE]]
return merge_dict(global_preprocessing_parameters[feature_config[TYPE]], feature_config[PREPROCESSING])
def build_preprocessing_parameters(
dataset_cols: dict[str, Series],
feature_configs: list[FeatureConfigDict],
global_preprocessing_parameters: PreprocessingConfigDict,
backend: Backend,
metadata: TrainingSetMetadataDict | None = None,
) -> PreprocessingConfigDict:
if metadata is None:
metadata = {}
feature_name_to_preprocessing_parameters = {}
for feature_config in feature_configs:
feature_name = feature_config[NAME]
# if metadata already exists, we can use it to get preprocessing parameters
if feature_name in metadata:
feature_name_to_preprocessing_parameters[feature_name] = metadata[feature_name][PREPROCESSING]
continue
preprocessing_parameters = feature_config[PREPROCESSING]
missing_value_strategy = preprocessing_parameters["missing_value_strategy"]
fill_value = precompute_fill_value(
dataset_cols, feature_config, missing_value_strategy, preprocessing_parameters, backend
)
if fill_value is not None:
preprocessing_parameters.update({"computed_fill_value": fill_value})
# Handle outlier replacement
outlier_strategy = preprocessing_parameters.get("outlier_strategy")
if outlier_strategy is not None:
if outlier_strategy != missing_value_strategy:
outlier_fill_value = precompute_fill_value(
dataset_cols, feature_config, outlier_strategy, preprocessing_parameters, backend
)
else:
# Use fill value from missing_value_strategy to avoid redundant computation
outlier_fill_value = fill_value
if outlier_fill_value is not None:
preprocessing_parameters.update({"computed_outlier_fill_value": outlier_fill_value})
feature_name_to_preprocessing_parameters[feature_name] = preprocessing_parameters
return feature_name_to_preprocessing_parameters
def is_input_feature(feature_config: FeatureConfigDict) -> bool:
"""Utility function to check for the presence of encoder in the feature config to determine if the feature is
an input feature or output feature."""
return ENCODER in feature_config
def build_metadata(
config: ModelConfigDict,
metadata: TrainingSetMetadataDict,
feature_name_to_preprocessing_parameters: dict[str, PreprocessingConfigDict],
dataset_cols: dict[str, Series],
feature_configs: list[FeatureConfigDict],
backend: Backend,
) -> TrainingSetMetadataDict:
for feature_config in feature_configs:
feature_name = feature_config[NAME]
if feature_name in metadata:
continue
preprocessing_parameters = feature_name_to_preprocessing_parameters[feature_name]
column = dataset_cols[feature_config[COLUMN]]
metadata[feature_name] = get_from_registry(feature_config[TYPE], get_base_type_registry()).get_feature_meta(
config, column, preprocessing_parameters, backend, is_input_feature(feature_config)
)
metadata[feature_name][PREPROCESSING] = preprocessing_parameters
return metadata
def build_data(
input_cols: DataFrame,
feature_configs: list[dict],
training_set_metadata: dict,
backend: Backend,
skip_save_processed_input: bool,
) -> dict[str, DataFrame]:
"""Preprocesses the input dataframe columns, handles missing values, and potentially adds metadata to
training_set_metadata.
Args:
input_cols: Input dataframe to be processed.
feature_configs: List of feature configs.
training_set_metadata: Training set metadata. Additional fields may be added.
backend: Backend for data processing.
skip_save_processed_input: (bool) Whether to skip saving the processed input.
Returns:
Dictionary of (feature name) -> (processed data).
"""
proc_cols = {}
for feature_config in feature_configs:
# TODO(travis): instead of using raw dictionary, this should be loaded into a proper PreprocessingConfig
# object, so we don't need to hackily check for the presence of added keys.
preprocessing_parameters = training_set_metadata[feature_config[NAME]][PREPROCESSING]
# Need to run this again here as cast_columns may have introduced new missing values
handle_missing_values(input_cols, feature_config, preprocessing_parameters, backend)
# For features that support it, we perform outlier removal here using metadata computed on the full dataset
handle_outliers(
input_cols, feature_config, preprocessing_parameters, training_set_metadata[feature_config[NAME]], backend
)
get_from_registry(feature_config[TYPE], get_base_type_registry()).add_feature_data(
feature_config,
input_cols,
proc_cols,
training_set_metadata,
preprocessing_parameters,
backend,
skip_save_processed_input,
)
return proc_cols
def balance_data(
dataset_df: DataFrame,
output_features: list[dict],
preprocessing_parameters: dict,
backend: Backend,
random_seed: int,
):
"""The purpose of this function is to balance the training dataset using either over-sampling or under-
sampling.
Args:
dataset_df: Input dataframe to be over-sampled or under-sampled.
output_features: List of feature configs.
preprocessing_parameters: Dictionary of the global preprocessing parameters.
backend: Backend for data processing.
random_seed: Integer to seed the random sampling to ensure determinism.
Returns: An over-sampled or under-sampled training dataset.
"""
target = output_features[0][PROC_COLUMN]
if backend.df_engine.partitioned:
majority_class = backend.df_engine.compute(dataset_df[target].value_counts()).idxmax()
minority_class = backend.df_engine.compute(dataset_df[target].value_counts()).idxmin()
else:
majority_class = dataset_df[target].value_counts().idxmax()
minority_class = dataset_df[target].value_counts().idxmin()
majority_df = dataset_df[dataset_df[target] == majority_class]
minority_df = dataset_df[dataset_df[target] == minority_class]
if preprocessing_parameters["oversample_minority"]:
sample_fraction = (len(majority_df) * preprocessing_parameters["oversample_minority"]) / len(minority_df)
minority_df = minority_df.sample(frac=sample_fraction, replace=True, random_state=random_seed)
elif preprocessing_parameters["undersample_majority"]:
sample_fraction = int(len(minority_df) / preprocessing_parameters["undersample_majority"]) / len(majority_df)
majority_df = majority_df.sample(frac=sample_fraction, replace=False, random_state=random_seed)
balanced_df = backend.df_engine.concat([minority_df, majority_df])
return balanced_df
def precompute_fill_value(
dataset_cols, feature, missing_value_strategy: str, preprocessing_parameters: PreprocessingConfigDict, backend
):
"""Precomputes the fill value for a feature.
NOTE: this is called before NaNs are removed from the dataset. Modifications here must handle NaNs gracefully.
NOTE: this is called before columns are cast. Modifications here must handle dtype conversion gracefully.
"""
if missing_value_strategy == FILL_WITH_CONST:
return preprocessing_parameters["fill_value"]
elif missing_value_strategy == FILL_WITH_MODE:
# Requires separate handling if Dask since Dask has lazy evaluation
# Otherwise, dask returns a Dask index structure instead of a value to use as a fill value
return (
dataset_cols[feature[COLUMN]].value_counts().index.compute()[0]
if is_dask_series_or_df(dataset_cols[feature[COLUMN]], backend)
else dataset_cols[feature[COLUMN]].value_counts().index[0]
)
elif missing_value_strategy == FILL_WITH_MEAN:
if feature[TYPE] != NUMBER:
raise ValueError(
f"Filling missing values with mean is supported "
f"only for number types, not for type {feature[TYPE]}.",
)
return backend.df_engine.compute(dataset_cols[feature[COLUMN]].astype(float).mean())
elif missing_value_strategy in {FILL_WITH_FALSE, FILL_WITH_TRUE}:
distinct_values = backend.df_engine.compute(
dataset_cols[feature[COLUMN]].drop_duplicates().dropna()
).values.tolist()
if len(distinct_values) > 2:
raise ValueError(
f"Missing value strategy `{missing_value_strategy}` "
f"for column {feature[COLUMN]} expects 2 distinct values, "
f"found: {len(distinct_values)} (ex: {distinct_values[:10]})"
)
fill_to_bool_value = {FILL_WITH_FALSE: False, FILL_WITH_TRUE: True}
bool_needed = fill_to_bool_value[missing_value_strategy]
# Determine the False label.
# Distinct values are sorted in reverse to mirror the selection of the default fallback_true_label (in
# binary_feature.get_feature_meta) for binary columns with unconventional boolean values, "human"/"bot".
for v in sorted(distinct_values, reverse=True):
fallback_true_label = (
preprocessing_parameters["fallback_true_label"]
# By default, preprocessing_parameters.fallback_true_label is None.
if preprocessing_parameters["fallback_true_label"]
else "true"
)
if strings_utils.str2bool(v, fallback_true_label) is bool_needed:
return v
raise ValueError(
f"Unable to determine {bool_needed} value for column {feature[COLUMN]} "
f"with distinct values: {distinct_values}."
)
# Otherwise, we cannot precompute the fill value for this dataset
return None
@DeveloperAPI
def handle_missing_values(dataset_cols, feature, preprocessing_parameters: PreprocessingConfigDict, backend):
missing_value_strategy = preprocessing_parameters["missing_value_strategy"]
computed_fill_value = preprocessing_parameters.get("computed_fill_value")
_handle_missing_values(dataset_cols, feature, missing_value_strategy, computed_fill_value, backend)
@DeveloperAPI
def handle_outliers(dataset_cols, feature, preprocessing_parameters: PreprocessingConfigDict, metadata, backend):
outlier_strategy = preprocessing_parameters.get("outlier_strategy")
if outlier_strategy is None:
return
outlier_threshold = preprocessing_parameters["outlier_threshold"]
computed_fill_value = preprocessing_parameters.get("computed_outlier_fill_value")
# Identify all outliers and set them to NA so they can be removed
series = dataset_cols[feature[COLUMN]]
dataset_cols[feature[COLUMN]] = series.mask(
series.sub(metadata["mean"]).div(metadata["std"]).abs().gt(outlier_threshold)
)
_handle_missing_values(dataset_cols, feature, outlier_strategy, computed_fill_value, backend)
def _handle_missing_values(
dataset_cols, feature, missing_value_strategy: str, computed_fill_value: float | None, backend
):
if (
missing_value_strategy in {FILL_WITH_CONST, FILL_WITH_MODE, FILL_WITH_MEAN, FILL_WITH_FALSE, FILL_WITH_TRUE}
and computed_fill_value is not None
):
dataset_cols[feature[COLUMN]] = dataset_cols[feature[COLUMN]].fillna(
computed_fill_value,
)
elif missing_value_strategy in {BFILL, FFILL}:
if missing_value_strategy == BFILL:
dataset_cols[feature[COLUMN]] = dataset_cols[feature[COLUMN]].bfill()
else:
dataset_cols[feature[COLUMN]] = dataset_cols[feature[COLUMN]].ffill()
# If the first few rows or last few rows of a dataset is a NaN, it will still be a NaN after ffill or bfill are
# applied. This causes downstream errors with Dask (https://github.com/ludwig-ai/ludwig/issues/2452)
# To get around this issue, apply the primary missing value strategy (say bfill) first, and then follow it
# up with the other missing value strategy (ffill) to ensure all NaNs are filled
if backend.df_engine.compute(dataset_cols[feature[COLUMN]].isna().sum()) > 0:
if missing_value_strategy == FFILL:
dataset_cols[feature[COLUMN]] = dataset_cols[feature[COLUMN]].bfill()
else:
dataset_cols[feature[COLUMN]] = dataset_cols[feature[COLUMN]].ffill()
elif missing_value_strategy == DROP_ROW:
# Here we only drop from this series, but after preprocessing we'll do a second
# round of dropping NA values from the entire output dataframe, which will
# result in the removal of the rows.
len_before_dropped_rows = len(dataset_cols[feature[COLUMN]])
dataset_cols[feature[COLUMN]] = dataset_cols[feature[COLUMN]].dropna()
len_after_dropped_rows = len(dataset_cols[feature[COLUMN]])
if len_before_dropped_rows != len_after_dropped_rows:
logger.warning(
f"DROP_ROW missing value strategy applied. Dropped {len_before_dropped_rows - len_after_dropped_rows} "
f"samples out of {len_before_dropped_rows} from column {feature[COLUMN]}. The rows containing these "
f"samples will ultimately be dropped from the dataset."
)
else:
raise ValueError(f"Invalid missing value strategy {missing_value_strategy}")
def handle_features_with_prompt_config(
config: ModelConfigDict,
dataset_df: DataFrame,
features: list[FeatureConfigDict],
backend: Backend,
split_col: Series | None = None,
) -> dict[str, Series]:
"""Updates (in-place) dataset columns with prompt configurations containing a non-None task parameter.
Dataset columns that are updated here are enriched to have prompts as specified by the prompt configuration.
Args:
config: Model configuration.
dataset_df (DataFrame): Input dataset.
features (List[FeatureConfigDict]): List of feature configurations.
df_engine (DataFrameEngine): Dataframe engine.
split_col (Optional[Series], optional): Split column. Defaults to None.
Returns:
Dict[str, Series]: Modified dataset columns.
"""
dataset_cols = {}
input_features, output_features = get_input_and_output_features(features)
for input_feature_config in input_features:
prompt_config = _get_prompt_config(config, input_feature_config)
if prompt_config is None:
continue
input_col_name = input_feature_config[COLUMN]
if prompt_config["retrieval"]["type"] is not None:
# Ensure that the output features are in the dataset columns saved as part of the index
# so that they can be retrieved later at lookup time.
output_feature_col_names = [output_feature_config[COLUMN] for output_feature_config in output_features]
input_and_output_col_names = set([input_col_name] + output_feature_col_names)
input_and_output_cols = {
feature[NAME]: dataset_df[feature[COLUMN]]
for feature in features
if feature[NAME] in input_and_output_col_names
}
retrieval_model, index_name = index_column(
prompt_config["retrieval"],
col_name=input_col_name,
dataset_cols=input_and_output_cols,
backend=backend,
split_col=split_col,
)
k = prompt_config["retrieval"]["k"]
# NOTE: after indexing the input column, we update the index_name in the prompt config IN PLACE.
# This ensures that the preprocessing parameters for this feature have an up-to-date index_name
# when the training set metadata is saved.
prompt_config["retrieval"]["index_name"] = index_name
else:
retrieval_model = None
k = -1
dataset_cols[input_col_name] = format_input_with_prompt(
input_col_name,
dataset_df,
backend,
prompt_config["task"],
retrieval_model=retrieval_model,
k=k,
template=prompt_config["template"],
)
return dataset_cols
def _get_prompt_config(config: ModelConfigDict, input_feature_config: dict) -> dict:
if input_feature_config[TYPE] != TEXT:
# Prompt config is only applied to text features
return None
preprocessing = input_feature_config["preprocessing"]
if _has_prompt_section(preprocessing):
return preprocessing["prompt"]
if _has_prompt_section(config):
return config["prompt"]
return None
def _has_prompt_section(config: dict) -> bool:
return "prompt" in config and (config["prompt"]["template"] is not None or config["prompt"]["task"] is not None)
def load_hdf5(hdf5_file_path, preprocessing_params, backend, split_data=True, shuffle_training=False):
# TODO dask: this needs to work with DataFrames
logger.info(f"Loading data from: {hdf5_file_path}")
def shuffle(df):
return df.sample(frac=1).reset_index(drop=True)
dataset = data_utils.load_hdf5(hdf5_file_path)
if not split_data:
if shuffle_training:
dataset = shuffle(dataset)
return dataset
training_set, validation_set, test_set = split_dataset(dataset, preprocessing_params, backend)
if shuffle_training:
training_set = shuffle(training_set)
return training_set, test_set, validation_set
def load_metadata(metadata_file_path: str) -> TrainingSetMetadataDict:
logger.info(f"Loading metadata from: {metadata_file_path}")
training_set_metadata = data_utils.load_json(metadata_file_path)
# TODO(travis): decouple config from training_set_metadata so we don't need to
# upgrade it over time.
training_set_metadata = upgrade_metadata(training_set_metadata)
return training_set_metadata
def drop_extra_cols(features, dfs):
retain_cols = list({feature[PROC_COLUMN]: True for feature in features}.keys())
return tuple(df[retain_cols] if df is not None else df for df in dfs)
def preprocess_for_training(
config,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
data_format=None,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
) -> tuple[Dataset, Dataset, Dataset, TrainingSetMetadataDict]:
"""Returns training, val and test datasets with training set metadata."""
# sanity check to make sure some data source is provided
if dataset is None and training_set is None:
raise ValueError("No training data is provided!")
# preload ludwig and HF datasets
dataset, training_set, validation_set, test_set = load_dataset_uris(
dataset, training_set, validation_set, test_set, backend
)
# determine data format if not provided or auto
if not data_format or data_format == "auto":
data_format = figure_data_format(dataset, training_set, validation_set, test_set)
# Wrap dataset into a form we can use to manage within the cache
dataset = wrap(dataset)
training_set = wrap(training_set)
validation_set = wrap(validation_set)
test_set = wrap(test_set)
try:
lock_path = backend.cache.get_cache_directory(dataset)
except (TypeError, ValueError):
lock_path = None
with file_lock(lock_path, lock_file=".lock_preprocessing"):
# if training_set_metadata is a string, assume it's a path to load the json
training_set_metadata = training_set_metadata or {}
if training_set_metadata and isinstance(training_set_metadata, str):
training_set_metadata = load_metadata(training_set_metadata)
# setup
features = config["input_features"] + config["output_features"]
# in case data_format is one of the cacheable formats,
# check if there's a cached hdf5 file with the same name,
# and in case move on with the hdf5 branch.
cached = False
cache = backend.cache.get_dataset_cache(config, dataset, training_set, test_set, validation_set)
# Unwrap dataset into the form used for preprocessing
dataset = dataset.unwrap() if dataset is not None else None
training_set = training_set.unwrap() if training_set is not None else None
validation_set = validation_set.unwrap() if validation_set is not None else None
test_set = test_set.unwrap() if test_set is not None else None
if data_format in CACHEABLE_FORMATS:
with backend.storage.cache.use_credentials():
# cache.get() returns valid indicating if the checksum for the current config
# is equal to that from the cached training set metadata, as well as the paths to the
# cached training set metadata, training set, validation_set, test set
cache_results = cache.get()
if cache_results is not None:
valid, *cache_values = cache_results
if valid:
logger.info(_get_cache_hit_message(cache))
training_set_metadata, training_set, test_set, validation_set = cache_values
config["data_hdf5_fp"] = training_set
data_format = backend.cache.data_format
cached = True
dataset = None
else:
logger.info(
"Found cached dataset and meta.json with the same filename "
"of the dataset, but checksums don't match, "
"if saving of processed input is not skipped "
"they will be overridden"
)
cache.delete()
else:
logger.info(
f"No cached dataset found at {cache.get_cached_obj_path('training')}. "
"Preprocessing the dataset."
)
training_set_metadata[CHECKSUM] = cache.checksum
data_format_processor = get_from_registry(data_format, data_format_preprocessor_registry)
if cached or data_format == "hdf5":
with backend.storage.cache.use_credentials():
# Always interpret hdf5 files as preprocessed, even if missing from the cache
processed = data_format_processor.prepare_processed_data(
features,
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
)
training_set, test_set, validation_set, training_set_metadata = processed
else:
processed = data_format_processor.preprocess_for_training(
config,
features,
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
skip_save_processed_input=skip_save_processed_input,
preprocessing_params=preprocessing_params,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
training_set, test_set, validation_set, training_set_metadata = processed
processed = (training_set, test_set, validation_set, training_set_metadata)
# cache the dataset
if backend.cache.can_cache(skip_save_processed_input):
with backend.storage.cache.use_credentials():
logger.debug("cache processed data")
processed = cache.put(*processed)
# set cached=True to ensure credentials are used correctly below
cached = True
training_set, test_set, validation_set, training_set_metadata = processed
with backend.storage.cache.use_credentials() if cached else contextlib.nullcontext():
logger.debug("create training dataset")
training_dataset = backend.dataset_manager.create(training_set, config, training_set_metadata)
training_set_size = len(training_dataset)
if training_set_size == 0:
raise ValueError("Training data is empty following preprocessing.")
elif training_set_size < MIN_DATASET_SPLIT_ROWS:
raise ValueError(
f"Training dataset has only {training_set_size} rows following preprocessing, need"
f" at least {MIN_DATASET_SPLIT_ROWS} to compute metrics."
)
validation_dataset = None
if validation_set is not None:
logger.debug("create validation dataset")
validation_dataset = backend.dataset_manager.create(validation_set, config, training_set_metadata)
validation_set_size = len(validation_dataset)
if validation_set_size == 0:
logger.warning(
"Validation set empty. If this is unintentional, please check the preprocessing configuration."
)
validation_dataset = None
elif validation_set_size < MIN_DATASET_SPLIT_ROWS:
logger.warning(
f"Validation set too small to compute metrics. Need at least {MIN_DATASET_SPLIT_ROWS} rows, got"
f" {validation_set_size} after preprocessing."
)
test_dataset = None
if test_set is not None:
logger.debug("create test dataset")
test_dataset = backend.dataset_manager.create(test_set, config, training_set_metadata)
test_set_size = len(test_dataset)
if test_set_size == 0:
logger.warning(
"Test set empty. If this is unintentional, please check the preprocessing configuration."
)
test_dataset = None
elif test_set_size < MIN_DATASET_SPLIT_ROWS:
logger.warning(
f"Test set too small to compute metrics. Need at least {MIN_DATASET_SPLIT_ROWS} rows, got"
f" {test_set_size} after preprocessing."
)
return (training_dataset, validation_dataset, test_dataset, training_set_metadata)
def _preprocess_file_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
read_fn=read_csv,
skip_save_processed_input=False,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
"""Method to pre-process csv data.
:param features: list of all features (input + output)
:param dataset: path to the data
:param training_set: training data
:param validation_set: validation data
:param test_set: test data
:param training_set_metadata: train set metadata
:param skip_save_processed_input: if False, the pre-processed data is saved as .hdf5 files in the same location as
the csv files with the same names.
:param preprocessing_params: preprocessing parameters
:param random_seed: random seed
:return: training, test, validation datasets, training metadata
"""
if dataset:
# Use data and ignore _train, _validation and _test.
# Also ignore data and train set metadata needs preprocessing
logger.info("Using full raw dataset, no hdf5 and json file " "with the same name have been found")
logger.info("Building dataset (it may take a while)")
dataset_df = read_fn(dataset, backend.df_engine.df_lib)
training_set_metadata[SRC] = dataset
data, training_set_metadata = build_dataset(
config,
dataset_df,
features,
preprocessing_params,
mode="training",
metadata=training_set_metadata,
backend=backend,
random_seed=random_seed,
skip_save_processed_input=skip_save_processed_input,
callbacks=callbacks,
)
elif training_set:
# use data_train (including _validation and _test if they are present)
# and ignore data and train set metadata
# needs preprocessing
logger.info("Using training raw csv, no hdf5 and json " "file with the same name have been found")
logger.info("Building dataset (it may take a while)")
concatenated_df = concatenate_files(training_set, validation_set, test_set, read_fn, backend)
training_set_metadata[SRC] = training_set
# Data is pre-split.
preprocessing_params = set_fixed_split(preprocessing_params)
data, training_set_metadata = build_dataset(
config,
concatenated_df,
features,
preprocessing_params,
mode="training",
metadata=training_set_metadata,
backend=backend,
random_seed=random_seed,
callbacks=callbacks,
)
else:
raise ValueError("either data or data_train have to be not None")
logger.debug("split train-val-test")
training_data, validation_data, test_data = drop_extra_cols(
features, split_dataset(data, preprocessing_params, backend, random_seed)
)
if dataset and backend.is_coordinator() and not skip_save_processed_input:
logger.debug("writing split file")
splits_df = concatenate_splits(training_data, validation_data, test_data, backend)
split_fp = get_split_path(dataset or training_set)
try:
backend.df_engine.to_parquet(splits_df, split_fp, index=True)
except Exception as e:
logger.warning(
f"Encountered error: '{e}' while writing data to parquet during saving preprocessed data. "
"Skipping saving processed data."
)
logger.info("Building dataset: DONE")
if preprocessing_params["oversample_minority"] or preprocessing_params["undersample_majority"]:
training_data = balance_data(
training_data, config["output_features"], preprocessing_params, backend, random_seed
)
return training_data, test_data, validation_data, training_set_metadata
def _preprocess_df_for_training(
config,
features,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
preprocessing_params=default_training_preprocessing_parameters,
backend=LOCAL_BACKEND,
random_seed=default_random_seed,
callbacks=None,
):
"""Method to pre-process dataframes.
This doesn't have the option to save the processed data as hdf5 as we don't expect users to do this as the data can
be processed in memory
"""
if dataset is not None:
# needs preprocessing
logger.info("Using full dataframe")
elif training_set is not None:
# needs preprocessing
logger.info("Using training dataframe")
dataset = concatenate_df(training_set, validation_set, test_set, backend)
# Data is pre-split.
preprocessing_params = set_fixed_split(preprocessing_params)
logger.info("Building dataset (it may take a while)")
data, training_set_metadata = build_dataset(
config,
dataset,
features,
preprocessing_params,
mode="training",
metadata=training_set_metadata,
random_seed=random_seed,
backend=backend,
callbacks=callbacks,
)
logger.debug("split train-val-test")
training_set, validation_set, test_set = drop_extra_cols(
features, split_dataset(data, preprocessing_params, backend, random_seed)
)
logger.info("Building dataset: DONE")
if preprocessing_params["oversample_minority"] or preprocessing_params["undersample_majority"]:
training_set = balance_data(training_set, config["output_features"], preprocessing_params, backend, random_seed)
return training_set, test_set, validation_set, training_set_metadata
def preprocess_for_prediction(
config,
dataset,
training_set_metadata=None,
data_format=None,
split=FULL,
include_outputs=True,
backend=LOCAL_BACKEND,
callbacks=None,
):
"""Preprocesses the dataset to parse it into a format that is usable by the Ludwig core.
Args:
config: Config dictionary corresponding to Ludwig Model
dataset: Dataset to be processed
training_set_metadata: Train set metadata for the input features
data_format: Format of the data
split: The split of dataset to return
include_outputs: Whether to include outputs
backend: Type of backend to use for preprocessing
callbacks: Any callbacks passed in
Returns:
Processed dataset along with updated training set metadata
"""
# Sanity Check to make sure some data source is provided
if dataset is None:
raise ValueError("No training data is provided!")
if isinstance(dataset, Dataset):
return dataset, training_set_metadata
# preload ludwig and HF datasets
dataset, _, _, _ = load_dataset_uris(dataset, None, None, None, backend)
# determine data format if not provided or auto
if not data_format or data_format == "auto":
data_format = figure_data_format(dataset)
# manage the in_memory parameter
if data_format not in HDF5_FORMATS:
num_overrides = override_in_memory_flag(config["input_features"], True)
if num_overrides > 0:
logger.warning("Using in_memory = False is not supported " "with {} data format.".format(data_format))
preprocessing_params = {}
config_defaults = config.get(DEFAULTS, {})
for feature_type in config_defaults:
preprocessing_params[feature_type] = config_defaults[feature_type].get(PREPROCESSING, {})
preprocessing_params[SPLIT] = config.get(PREPROCESSING, {}).get(SPLIT, {})
preprocessing_params = merge_dict(default_prediction_preprocessing_parameters, preprocessing_params)
# if training_set_metadata is a string, assume it's a path to load the json
if training_set_metadata and isinstance(training_set_metadata, str):
training_set_metadata = load_metadata(training_set_metadata)
# setup
output_features = []
if include_outputs:
output_features += config["output_features"]
features = config["input_features"] + output_features
# Check the cache for an already preprocessed dataset. This only
# applies to scenarios where the user wishes to predict on a split
# of the full dataset, where we preprocess the whole dataset together
# during training. If the user wishes to predict on the full dataset,
# it is assumed they are predicting on unseen data. This is done
# because the cached data is stored in its split form, and would be
# expensive to recombine, requiring further caching.
cached = False
dataset = wrap(dataset)
cache = backend.cache.get_dataset_cache(config, dataset)
dataset = dataset.unwrap()
training_set = test_set = validation_set = None
if data_format in CACHEABLE_FORMATS and split != FULL:
with backend.storage.cache.use_credentials():
cache_results = cache.get()
if cache_results is not None:
valid, *cache_values = cache_results
if valid:
logger.info(_get_cache_hit_message(cache))
training_set_metadata, training_set, test_set, validation_set = cache_values
config["data_hdf5_fp"] = training_set
data_format = backend.cache.data_format
cached = True
data_format_processor = get_from_registry(data_format, data_format_preprocessor_registry)
if cached:
with backend.storage.cache.use_credentials():
processed = data_format_processor.prepare_processed_data(
features,
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
preprocessing_params=preprocessing_params,
backend=backend,
)
training_set, test_set, validation_set, training_set_metadata = processed
else:
processed = data_format_processor.preprocess_for_prediction(
config, dataset, features, preprocessing_params, training_set_metadata, backend, callbacks
)
dataset, training_set_metadata, new_hdf5_fp = processed
training_set_metadata = training_set_metadata.copy()
if new_hdf5_fp:
training_set_metadata[DATA_TRAIN_HDF5_FP] = new_hdf5_fp
if split != FULL:
logger.debug("split train-val-test")
training_set, validation_set, test_set = drop_extra_cols(
features, split_dataset(dataset, preprocessing_params, backend)
)
if split == TRAINING:
dataset = training_set
elif split == VALIDATION:
dataset = validation_set
elif split == TEST:
dataset = test_set
config = {
**config,
"output_features": output_features,
}
with backend.storage.cache.use_credentials() if cached else contextlib.nullcontext():
dataset = backend.dataset_manager.create(
dataset,
config,
training_set_metadata,
)
return dataset, training_set_metadata
def _get_cache_hit_message(cache: DatasetCache) -> str:
return (
"Found cached dataset and meta.json with the same filename of the dataset.\n"
"Using cached values instead of preprocessing the dataset again.\n"
f"- Cached training set metadata path: {cache.get_cached_obj_path(META)}\n"
f"- Cached training set path: {cache.get_cached_obj_path(TRAINING)}\n"
f"- Cached validation set path: {cache.get_cached_obj_path(VALIDATION)}\n"
f"- Cached test set path: {cache.get_cached_obj_path(TEST)}"
)
================================================
FILE: ludwig/data/prompt.py
================================================
import json
import logging
import os
import string
from typing import Any, TYPE_CHECKING
import pandas as pd
if TYPE_CHECKING:
from ludwig.backend.base import Backend
from ludwig.models.retrieval import df_checksum, get_retrieval_model, RetrievalModel
from ludwig.utils.fs_utils import get_default_cache_location, makedirs, path_exists
from ludwig.utils.types import DataFrame, Series
logger = logging.getLogger(__name__)
CONTEXT = "__context__"
SAMPLE = "__sample__"
TASK = "__task__"
DEFAULT_ZERO_SHOT_PROMPT_TEMPLATE = """SAMPLE INPUT: {__sample__}
USER: Complete the following task: {__task__}
ASSISTANT:
"""
DEFAULT_FEW_SHOT_PROMPT_TEMPLATE = """Below is relevant context:
CONTEXT: {__context__}
CONTEXT is comprised of labeled samples whose embeddings were similar to that of the sample input. The labels in
these samples could aid you in your final prediction. Given this and no prior knowledge, follow the instructions
below.
SAMPLE INPUT: {__sample__}
USER: Complete the following task: {__task__}
ASSISTANT:
"""
def index_column(
retrieval_config: dict[str, Any],
col_name: str,
dataset_cols: dict[str, Series],
backend: "Backend",
split_col: Series | None = None,
) -> tuple[RetrievalModel, str]:
"""Indexes a column for sample retrieval via embedding index lookup.
This function indexes a column and saves the index artifact to disk. If an index name is provided as part of the
`retrieval_config`, then the index in the ludwig cache with the corresponding name will be loaded instead of being
built from scratch.
To prevent data leakage, a split column must be provided. This ensures that the retrieval model only ever fetches
samples from the training set.
To ensure that the index is usable even if the original DataFrame is not available, the columns used to build the
index are stored as part of the index.
All operations in this function are performed on pandas objects, which means that you may run out of memory if your
dataset is large.
Args:
retrieval_config (Dict[str, Any]): The retrieval config from the config object.
col_name (str): The name of the column to index.
dataset_cols (Dict[str, Series]): A dictionary mapping column names to their corresponding Series. `col_name`
must be a key in this dictionary. These columns are stored as part of the index to ensure that the index
is usable even if the original DataFrame is not available.
df_engine (DataFrameEngine): The engine used to compute the columns into pandas objects.
split_col (Optional[Series]): A column that indicates whether a sample is part of the training set. A sample
is in the training set if the value in this column is 0.
Returns:
Tuple[RetrievalModel, str]: A tuple containing the retrieval model and the name of the index.
"""
retrieval_model = get_retrieval_model(
retrieval_config["type"],
model_name=retrieval_config["model_name"],
)
index_name = retrieval_config["index_name"]
index_cache_directory = os.path.join(get_default_cache_location(), "index")
if not path_exists(index_cache_directory):
makedirs(index_cache_directory, exist_ok=True)
if index_name is None:
if split_col is None:
raise ValueError("split column must be provided if using retrieval")
split_col = backend.df_engine.compute(split_col).astype(int)
# TODO(geoffrey): add support for Dask DataFrames
df = pd.DataFrame({name: backend.df_engine.compute(col) for name, col in dataset_cols.items()})
df = df[split_col == 0] # Ensures that the index is only built on the training set
# Even if index name is not provided, we still want to check if an index for this df already exists in cache
# If it does, load it and return immediately
index_hash = df_checksum(df)
index_name = f"embedding_index_{index_hash}"
if path_exists(os.path.join(index_cache_directory, index_name)):
logger.info(
f"Index for this DataFrame with name '{index_name}' already exists. "
f"Loading index from '{index_cache_directory}'"
)
retrieval_model.load_index(index_name, cache_directory=index_cache_directory)
return retrieval_model, index_name
# Build index if index name is not provided and index for this df does not already exist in cache
retrieval_model.create_dataset_index(df, backend, columns_to_index=[col_name])
logger.info(f"Saving index to cache directory '{index_cache_directory}' with name '{index_name}'")
retrieval_model.save_index(index_name, cache_directory=index_cache_directory)
else:
logger.info(f"Loading index from cache directory '{index_cache_directory}' with name '{index_name}'")
retrieval_model.load_index(index_name, cache_directory=index_cache_directory)
return retrieval_model, index_name
def format_input_with_prompt(
input_col_name: str,
dataset_df: DataFrame,
backend: "Backend",
task_str: str,
retrieval_model: RetrievalModel | None = None,
k: int = -1,
template: str | None = None,
) -> Series:
"""Returns a new Series with the input column data formatted with the prompt.
A prompt can either be zero-shot or few-shot. A zero-shot prompt is comprised of some (unlabeled) input and a task
to be completed given the input. A few-shot prompt additionally includes some dynamically retrieved context, which
is retrieved using the `retrieval_model.search` function.
A template can be provided to customize the prompt. The template must be a string with the following fields:
- __sample__ or at least one column from the input dataset: The input sample.
- __context__: The context retrieved by the `search_fn` function. Only required if `search_fn` is provided.
- __task__: The task to be completed given the input. Only required if `task` is set in the prompt config.
Zero-shot example:
Before formatting:
input_col = ["I am happy"]
task_str = "sentiment analysis"
After formatting:
input_col = ["SAMPLE INPUT: I am happy\n\nUSER: Complete the following task: sentiment analysis\n\nASSISTANT:"]
Args:
input_col_name (str): The name of the input column.
dataset_df (DataFrame): The input dataset.
backend (Backend): The backend used for map operations.
task_str (str): The task to be completed given the input.
retrieval_model (Optional[RetrievalModel]): The retrieval model used to retrieve context. If provided, the
prompt will be few-shot. If not provided, the prompt will be zero-shot.
k (int): The number of samples to retrieve. Only required if `retrieval_model` is provided.
template (Optional[str]): The template to use for the prompt. If not provided, the default will be used.
Returns:
Series: A new Series with the input column data formatted with the prompt.
"""
# determine if this is a few-shot or zero-shot prompt
# few-shot prompts require a search function that returns samples from some dataset
is_few_shot = retrieval_model is not None
# if no template is provided, use the default template
if template is None:
if is_few_shot:
template = DEFAULT_FEW_SHOT_PROMPT_TEMPLATE
else:
template = DEFAULT_ZERO_SHOT_PROMPT_TEMPLATE
# ensure that the prompt template has all required fields
template_fields, field_to_dtype = _get_template_fields(template)
try:
_validate_prompt_template(template_fields, task_str, is_few_shot, dataset_df.columns, input_col_name)
except ValueError as e:
raise ValueError(f"template invalid for {'few-shot' if is_few_shot else 'zero-shot'} prompt: {e}")
def generate_prompt(df: pd.DataFrame):
if CONTEXT in template_fields:
df[CONTEXT] = retrieval_model.search(df, backend, k=k, return_data=True)
if SAMPLE in template_fields:
# During preprocessing, we're inserting quotes that change the token IDs completely if we
# don't remove the " from the string. For parity with expected user output, we need to get rid of them.
# TODO(Arnav): see if there's a way to only remove them if the entry does't have quotes. This currently
# removes all " from the string (even those not added by json.dumps), which is not ideal.
df[SAMPLE] = df[input_col_name].map(lambda entry: json.dumps(entry, indent=2).strip('"'))
if TASK in template_fields:
df[TASK] = task_str
def generate_prompt_for_row(row):
kwargs = {col: field_to_dtype[col](row[col]) for col in template_fields}
return template.format(**kwargs)
return df.apply(generate_prompt_for_row, axis=1)
result = backend.df_engine.map_partitions(dataset_df, generate_prompt, meta=(input_col_name, "object"))
result = backend.df_engine.persist(result) # persist to prevent re-computation
return result
def _validate_prompt_template(
template_fields: set[str], task: str | None, is_few_shot: bool, columns: list[str], input_col_name: str
):
"""Validates that the template contains the necessary fields for the prompt."""
if is_few_shot and CONTEXT not in template_fields:
raise ValueError(f"Prompt template must contain the '{CONTEXT}' field for few-shot learning")
if task is not None and TASK not in template_fields:
raise ValueError(f"Prompt template must contain the '{TASK}' field if a task is provided")
if SAMPLE in template_fields:
if input_col_name not in columns:
raise ValueError(
f"Prompt template contains the '{SAMPLE}' field, "
f"but the input column '{input_col_name}' is not in the dataset"
)
elif not any(col in template_fields for col in columns):
raise ValueError(
f"Prompt template must contain either the '{SAMPLE}' field or one of the columns from the dataset"
)
def _get_template_fields(template: str) -> tuple[set[str], dict[str, type]]:
"""Returns the fields in the template."""
parsed = [t for t in string.Formatter().parse(template) if t[1] is not None]
field_set = {field for _, field, _, _ in parsed}
dtype_map = {field: _get_dtype(format_spec) for _, field, format_spec, _ in parsed}
return field_set, dtype_map
def _get_dtype(format_spec: str) -> type:
# We need to prepare data in the row for different formatting options.
# If you have a number like 0.1234 in the DF and you want to format it like {number:.2f} it will fail if the
# number is represented as a string in the DF. So we need to cast it to a float before formatting.
if not format_spec:
return str
if "f" in format_spec:
return float
raise ValueError(f"Unsupported template format spec: {format_spec}")
================================================
FILE: ludwig/data/sampler.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import math
import numpy as np
from ludwig.distributed import DistributedStrategy
from ludwig.utils.defaults import default_random_seed
class DistributedSampler:
"""Adapted from `torch.utils.data.distributed.DistributedSampler`."""
def __init__(
self,
dataset_size: int,
shuffle: bool = True,
random_seed: int = default_random_seed,
distributed: DistributedStrategy = None,
):
self.dataset_size = dataset_size
self.num_replicas = distributed.size() if distributed else 1
self.rank = distributed.rank() if distributed else 0
self.epoch = 0
self.num_samples = int(math.ceil(self.dataset_size * 1.0 / self.num_replicas))
self.total_size = self.num_samples * self.num_replicas
self.shuffle = shuffle
self.random_seed = random_seed
def __iter__(self):
if self.shuffle:
# deterministically shuffle based on epoch and seed
indices = np.random.RandomState(seed=self.random_seed + self.epoch).permutation(self.dataset_size).tolist()
else:
indices = list(range(self.dataset_size))
# add extra samples to make it evenly divisible
indices += indices[: (self.total_size - len(indices))]
assert len(indices) == self.total_size
# subsample
indices = indices[self.rank : self.total_size : self.num_replicas]
assert len(indices) == self.num_samples
return iter(indices)
def __len__(self):
return self.num_samples
def set_epoch(self, epoch):
"""Sets the epoch for this sampler.
When `shuffle=True`, this ensures all replicas use a different random ordering for each epoch. Otherwise, the
next iteration of this sampler will yield the same ordering.
:param epoch: (int) epoch number
"""
self.epoch = epoch
================================================
FILE: ludwig/data/split.py
================================================
#! /usr/bin/env python
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from abc import ABC, abstractmethod
from typing import TYPE_CHECKING
from zlib import crc32
import numpy as np
from sklearn.model_selection import train_test_split
from ludwig.api_annotations import DeveloperAPI
from ludwig.backend.base import Backend
from ludwig.constants import BINARY, CATEGORY, DATE, MIN_DATASET_SPLIT_ROWS, SPLIT
from ludwig.error import ConfigValidationError
from ludwig.schema.split import (
DateTimeSplitConfig,
FixedSplitConfig,
HashSplitConfig,
RandomSplitConfig,
StratifySplitConfig,
)
from ludwig.types import ModelConfigDict, PreprocessingConfigDict
from ludwig.utils.data_utils import hash_dict, split_dataset_ttv
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.registry import Registry
from ludwig.utils.types import DataFrame
if TYPE_CHECKING:
from ludwig.schema.model_config import ModelConfig
split_registry = Registry()
logger = logging.getLogger(__name__)
TMP_SPLIT_COL = "__SPLIT__"
DEFAULT_PROBABILITIES = (0.7, 0.1, 0.2)
class Splitter(ABC):
@abstractmethod
def split(
self, df: DataFrame, backend: Backend, random_seed: int = default_random_seed
) -> tuple[DataFrame, DataFrame, DataFrame]:
pass
def validate(self, config: ModelConfigDict):
pass
def has_split(self, split_index: int) -> bool:
return True
@property
def required_columns(self) -> list[str]:
"""Returns the list of columns that are required for splitting."""
return []
def _make_divisions_ensure_minimum_rows(
divisions: list[int],
n_examples: int,
min_val_rows: int = MIN_DATASET_SPLIT_ROWS,
min_test_rows: int = MIN_DATASET_SPLIT_ROWS,
) -> list[int]:
"""Revises divisions to ensure no dataset split has too few examples."""
result = list(divisions)
n = [dn - dm for dm, dn in zip((0,) + divisions, divisions + (n_examples,))] # Number of examples in each split.
if 0 < n[2] < min_test_rows and n[0] > 0:
# Test set is nonempty but too small, take examples from training set.
shift = min(min_test_rows - n[2], n[0])
result = [d - shift for d in result]
if 0 < n[1] < min_val_rows and n[0] > 0:
# Validation set is nonempty but too small, take examples from training set.
result[0] -= min(min_val_rows - n[1], result[0])
return result
def _split_divisions_with_min_rows(n_rows: int, probabilities: list[float]) -> list[int]:
"""Generates splits for a dataset of n_rows into train, validation, and test sets according to split
probabilities, also ensuring that at least min_val_rows or min_test_rows are present in each nonempty split.
Returns division indices to split on.
"""
d1 = int(np.ceil(probabilities[0] * n_rows))
if probabilities[-1] > 0:
n2 = int(probabilities[1] * n_rows)
d2 = d1 + n2
else:
# If the last probability is 0, then use the entire remaining dataset for validation.
d2 = n_rows
return _make_divisions_ensure_minimum_rows((d1, d2), n_rows)
@split_registry.register("random", default=True)
class RandomSplitter(Splitter):
def __init__(self, probabilities: list[float] = DEFAULT_PROBABILITIES, **kwargs):
self.probabilities = probabilities
def split(
self, df: DataFrame, backend: Backend, random_seed: float = default_random_seed
) -> tuple[DataFrame, DataFrame, DataFrame]:
probabilities = self.probabilities
if not backend.df_engine.partitioned:
divisions = _split_divisions_with_min_rows(len(df), probabilities)
shuffled_df = df.sample(frac=1, random_state=random_seed)
return (
shuffled_df.iloc[: divisions[0]], # Train
shuffled_df.iloc[divisions[0] : divisions[1]], # Validation
shuffled_df.iloc[divisions[1] :], # Test
)
# The above approach is very inefficient for partitioned backends, which can split by partition.
# This does not give exact guarantees on split size but is much more efficient for large datasets.
return df.random_split(self.probabilities, random_state=random_seed)
def has_split(self, split_index: int) -> bool:
return self.probabilities[split_index] > 0
@staticmethod
def get_schema_cls():
return RandomSplitConfig
@split_registry.register("fixed")
class FixedSplitter(Splitter):
def __init__(self, column: str = SPLIT, **kwargs):
self.column = column
def split(
self, df: DataFrame, backend: Backend, random_seed: float = default_random_seed
) -> tuple[DataFrame, DataFrame, DataFrame]:
df[self.column] = df[self.column].astype(np.int8)
dfs = split_dataset_ttv(df, self.column)
train, test, val = tuple(df.drop(columns=self.column) if df is not None else None for df in dfs)
return train, val, test
@property
def required_columns(self) -> list[str]:
return [self.column]
@staticmethod
def get_schema_cls():
return FixedSplitConfig
def stratify_split_dataframe(
df: DataFrame, column: str, probabilities: list[float], backend: Backend, random_seed: float
) -> tuple[DataFrame, DataFrame, DataFrame]:
"""Splits a dataframe into train, validation, and test sets based on the values of a column.
The column must be categorical (including binary). The split is stratified, meaning that the proportion of each
category in each split is the same as in the original dataset.
"""
frac_train, frac_val, frac_test = probabilities
def _safe_stratify(df, column, test_size):
# Get the examples with cardinality of 1
df_cadinalities = df.groupby(column)[column].size()
low_cardinality_elems = df_cadinalities.loc[lambda x: x == 1]
df_low_card = df[df[column].isin(low_cardinality_elems.index)]
df = df[~df[column].isin(low_cardinality_elems.index)]
y = df[[column]]
df_train, df_temp, _, _ = train_test_split(df, y, stratify=y, test_size=test_size, random_state=random_seed)
# concat the examples with cardinality of 1 to the training DF.
if len(df_low_card.index) > 0:
df_train = backend.df_engine.concat([df_train, df_low_card])
return df_train, df_temp
df_train, df_temp = _safe_stratify(df, column, 1.0 - frac_train)
relative_frac_test = frac_test / (frac_val + frac_test)
df_val, df_test = _safe_stratify(df_temp, column, relative_frac_test)
return df_train, df_val, df_test
@split_registry.register("stratify")
class StratifySplitter(Splitter):
def __init__(self, column: str, probabilities: list[float] = DEFAULT_PROBABILITIES, **kwargs):
self.column = column
self.probabilities = probabilities
def split(
self, df: DataFrame, backend: Backend, random_seed: float = default_random_seed
) -> tuple[DataFrame, DataFrame, DataFrame]:
if not backend.df_engine.partitioned:
return stratify_split_dataframe(df, self.column, self.probabilities, backend, random_seed)
# For a partitioned dataset, we can stratify split each partition individually
# to obtain a global stratified split.
def split_partition(partition: DataFrame) -> DataFrame:
"""Splits a single partition into train, val, test.
Returns a single DataFrame with the split column populated. Assumes that the split column is already present
in the partition and has a default value of 0 (train).
"""
partition = partition.copy()
_, val, test = stratify_split_dataframe(partition, self.column, self.probabilities, backend, random_seed)
# Split column defaults to train, so only need to update val and test
partition.loc[val.index, TMP_SPLIT_COL] = 1
partition.loc[test.index, TMP_SPLIT_COL] = 2
return partition
df[TMP_SPLIT_COL] = 0
df = backend.df_engine.map_partitions(df, split_partition, meta=df)
df_train = df[df[TMP_SPLIT_COL] == 0].drop(columns=TMP_SPLIT_COL)
df_val = df[df[TMP_SPLIT_COL] == 1].drop(columns=TMP_SPLIT_COL)
df_test = df[df[TMP_SPLIT_COL] == 2].drop(columns=TMP_SPLIT_COL)
return df_train, df_val, df_test
def validate(self, config: "ModelConfig"): # noqa: F821
features = [f for f in config.input_features] + [f for f in config.output_features]
feature_cols = {f.column for f in features}
if self.column not in feature_cols:
logging.info(
f"Stratify column {self.column} is not among the features. "
f"Cannot establish if it is a binary or category feature."
)
elif [f for f in features if f.column == self.column][0].type not in {BINARY, CATEGORY}:
raise ConfigValidationError(f"Feature for stratify column {self.column} must be binary or category")
def has_split(self, split_index: int) -> bool:
return self.probabilities[split_index] > 0
@property
def required_columns(self) -> list[str]:
return [self.column]
@staticmethod
def get_schema_cls():
return StratifySplitConfig
@split_registry.register("datetime")
class DatetimeSplitter(Splitter):
def __init__(
self,
column: str,
probabilities: list[float] = DEFAULT_PROBABILITIES,
datetime_format: str | None = None,
fill_value: str = "",
**kwargs,
):
self.column = column
self.probabilities = probabilities
self.datetime_format = datetime_format
self.fill_value = fill_value
def split(
self, df: DataFrame, backend: Backend, random_seed: float = default_random_seed
) -> tuple[DataFrame, DataFrame, DataFrame]:
# In case the split column was preprocessed by Ludwig into a list, convert it back to a
# datetime string for the sort and split
def list_to_date_str(x):
if not isinstance(x, list):
if not isinstance(x, str):
# Convert timestamps, etc. to strings and return so it can direct cast to epoch time
return str(x)
if len(x) != 9:
# Strings not in the expected format, so assume it's a formatted datetime and return
return x
return f"{x[0]}-{x[1]}-{x[2]} {x[5]}:{x[6]}:{x[7]}"
df[TMP_SPLIT_COL] = backend.df_engine.map_objects(df[self.column], list_to_date_str)
# Convert datetime to int64 to workaround Dask limitation
# https://github.com/dask/dask/issues/9003
df[TMP_SPLIT_COL] = backend.df_engine.df_lib.to_datetime(df[TMP_SPLIT_COL]).values.astype("int64")
# Sort by ascending datetime and drop the temporary column
df = df.sort_values(TMP_SPLIT_COL).drop(columns=TMP_SPLIT_COL)
# Split using different methods based on the underlying df engine.
# For Pandas, split by row index.
# For Dask, split by partition, as splitting by row is very inefficient.
return tuple(backend.df_engine.split(df, self.probabilities))
def validate(self, config: "ModelConfig"): # noqa: F821
features = [f for f in config.input_features] + [f for f in config.output_features]
feature_cols = {f.column for f in features}
if self.column not in feature_cols:
logging.info(
f"Datetime split column {self.column} is not among the features. "
f"Cannot establish if it is a valid datetime."
)
elif [f for f in features if f.column == self.column][0].type not in {DATE}:
raise ConfigValidationError(f"Feature for datetime split column {self.column} must be a datetime")
def has_split(self, split_index: int) -> bool:
return self.probabilities[split_index] > 0
@property
def required_columns(self) -> list[str]:
return [self.column]
@staticmethod
def get_schema_cls():
return DateTimeSplitConfig
@split_registry.register("hash")
class HashSplitter(Splitter):
def __init__(
self,
column: str,
probabilities: list[float] = DEFAULT_PROBABILITIES,
**kwargs,
):
self.column = column
self.probabilities = probabilities
def split(
self, df: DataFrame, backend: Backend, random_seed: float = default_random_seed
) -> tuple[DataFrame, DataFrame, DataFrame]:
# Maximum value of the hash function crc32
max_value = 2**32
thresholds = [v * max_value for v in self.probabilities]
def hash_column(x):
value = hash_dict({"value": x}, max_length=None)
hash_value = crc32(value)
if hash_value < thresholds[0]:
return 0
elif hash_value < (thresholds[0] + thresholds[1]):
return 1
else:
return 2
df[TMP_SPLIT_COL] = backend.df_engine.map_objects(df[self.column], hash_column).astype(np.int8)
dfs = split_dataset_ttv(df, TMP_SPLIT_COL)
train, test, val = tuple(df.drop(columns=TMP_SPLIT_COL) if df is not None else None for df in dfs)
return train, val, test
def has_split(self, split_index: int) -> bool:
return self.probabilities[split_index] > 0
@property
def required_columns(self) -> list[str]:
return [self.column]
@staticmethod
def get_schema_cls():
return HashSplitConfig
@DeveloperAPI
def get_splitter(type: str | None = None, **kwargs) -> Splitter:
splitter_cls = split_registry.get(type)
if splitter_cls is None:
return ValueError(f"Invalid split type: {type}")
return splitter_cls(**kwargs)
@DeveloperAPI
def split_dataset(
df: DataFrame,
global_preprocessing_parameters: PreprocessingConfigDict,
backend: Backend,
random_seed: float = default_random_seed,
) -> tuple[DataFrame, DataFrame, DataFrame]:
splitter = get_splitter(**global_preprocessing_parameters.get(SPLIT, {}))
datasets: tuple[DataFrame, DataFrame, DataFrame] = splitter.split(df, backend, random_seed)
if len(datasets[0].columns) == 0:
raise ValueError(
"Encountered an empty training set while splitting data. Please double check the preprocessing split "
"configuration."
)
# Remove partitions that are empty after splitting
datasets = [None if dataset is None else backend.df_engine.remove_empty_partitions(dataset) for dataset in datasets]
return datasets
================================================
FILE: ludwig/data/split_dataset.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import random
def split(input_path, output1, output2, split):
with open(input_path) as file:
lines = file.readlines()
random.shuffle(lines)
split_idx = int(len(lines) * split)
with open(output1, "w") as f:
for line in lines[:split_idx]:
line = line if line.endswith("\n") else line + "\n"
f.write(line)
with open(output2, "w") as f:
for line in lines[split_idx:]:
line = line if line.endswith("\n") else line + "\n"
f.write(line)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Split a file based on its lines")
parser.add_argument("-i", "--input", required=True, help="input file names")
parser.add_argument("-o1", "--output1", required=True, help="output 1 file name")
parser.add_argument("-o2", "--output2", required=True, help="output 2 file name")
parser.add_argument("-s", "--split", required=True, type=float, default=0.8, help="percentage of the split")
args = parser.parse_args()
split(args.input, args.output1, args.output2, args.split)
================================================
FILE: ludwig/data/utils.py
================================================
from typing import Optional
import numpy as np
from ludwig.constants import DECODER, ENCODER, SPLIT
from ludwig.types import FeatureConfigDict, PreprocessingConfigDict
from ludwig.utils.dataframe_utils import is_dask_series_or_df
from ludwig.utils.types import DataFrame
def convert_to_dict(
predictions: DataFrame,
output_features: dict[str, FeatureConfigDict],
backend: Optional["Backend"] = None, # noqa: F821
):
"""Convert predictions from DataFrame format to a dictionary."""
output = {}
for of_name, output_feature in output_features.items():
feature_keys = {k for k in predictions.columns if k.startswith(of_name)}
feature_dict = {}
for key in feature_keys:
subgroup = key[len(of_name) + 1 :]
values = predictions[key]
if is_dask_series_or_df(values, backend):
values = values.compute()
try:
values = np.stack(values.to_numpy())
except ValueError:
values = values.to_list()
feature_dict[subgroup] = values
output[of_name] = feature_dict
return output
def set_fixed_split(preprocessing_params: PreprocessingConfigDict) -> PreprocessingConfigDict:
"""Sets the split policy explicitly to a fixed split.
This potentially overrides the split configuration that the user set or what came from schema defaults.
"""
return {
**preprocessing_params,
"split": {
"type": "fixed",
"column": SPLIT,
},
}
def get_input_and_output_features(feature_configs):
"""Returns a tuple (input_features, output_features) where each element is a list of feature configs.
Determines whether a feature is an input or output feature by checking the presence of the encoder or decoder keys.
"""
input_features = []
output_features = []
for feature in feature_configs:
if ENCODER in feature:
input_features.append(feature)
elif DECODER in feature:
output_features.append(feature)
return input_features, output_features
================================================
FILE: ludwig/datasets/README.md
================================================
## Ludwig Datasets API
The Ludwig Dataset Zoo provides datasets that can be directly plugged into a Ludwig model. For each dataset, we've also
included an example Ludwig config which should train reasonably fast on a current-generation laptop.
The simplest way to use a dataset is to import it:
```python
from ludwig.datasets import titanic
# Loads into single dataframe with a 'split' column:
dataset_df = titanic.load()
# Loads into split dataframes:
train_df, test_df, _ = titanic.load(split=True)
```
The `ludwig.datasets` API provides functions to list, describe, and get datasets:
______________________________________________________________________
### list_datasets
Gets a list of the names of available datasets.
**Example:**
```python
dataset_names = ludwig.datasets.list_datasets()
```
______________________________________________________________________
### get_datasets_output_features
If a specific dataset name is passed in, then returns the output features associated with that dataset. Otherwise,
returns an ordered dictionary with dataset names as keys and dictionaries containing the output features for each
dataset as values.
**Example:**
```python
output_features = ludwig.datasets.get_datasets_output_features(dataset="titanic")
```
______________________________________________________________________
### describe_dataset
Gets a human-readable description string for a dataset
**Example:**
```python
print(ludwig.datasets.describe_dataset("titanic"))
```
______________________________________________________________________
### get_dataset
Get a dataset module by name
**Example:**
```python
titanic_dataset = ludwig.datasets.get_dataset("titanic")
```
______________________________________________________________________
### model_configs_for_dataset
Gets a dictionary of model configs for the specified dataset. Keys are the config names, and may
contain the special keys:
- `default` - The default config for the dataset. Should train to decent performance under 10 minutes on a typical
laptop without GPU.
- `best` - The best known config for the dataset. Should be replaced when a better config is found. This is a good
opportunity for contributions, if you find a better one please check it in and open a PR!
**Example:**
```python
configs = ludwig.datasets.model_configs_for_dataset("higgs")
default_higgs_config = configs["default"]
best_higgs_config = configs["best"]
```
______________________________________________________________________
## Training a model using builtin dataset and config
This example code trains a model on the Titanic dataset using the default config:
```python
from ludwig.api import LudwigModel
import ludwig.datasets
titanic = ludwig.datasets.get_dataset("titanic")
dataset_df = titanic.load()
titanic_config = titanic.default_model_config
model = LudwigModel(titanic_config)
model.train(dataset_df)
```
Some datasets are hosted on [Kaggle](https://www.kaggle.com) and require a kaggle account. To use these, you'll need to
[set up Kaggle credentials](https://www.kaggle.com/docs/api) in your environment. If the dataset is part of a Kaggle
competition, you'll need to accept the terms on the competition page.
To check programmatically, datasets have an `.is_kaggle_dataset` property.
## Downloading, Processing, and Exporting
Datasets are first downloaded into `LUDWIG_CACHE`, which may be set as an environment variable and defaults to
`$HOME/.ludwig_cache`.
Datasets are automatically loaded, processed, and re-saved as parquet files. The processed dataset is saved in
LUDWIG_CACHE.
If the dataset contains media files including images or audio, media files are saved in subdirectories and referenced by
relative paths from the dataset location. To ensure Ludwig can read these files during training, they should be
accessible from Ludwig's working directory.
To export the processed dataset, including any media files it depends on, use the `.export` method:
```python
from ludwig.datasets import twitter_bots
# Exports twitter bots dataset and image files to the current working directory.
twitter_bots.export(".")
# The working directory should now contain:
# ./twitter_bots.parquet - The twitter bots dataset
# ./profile_images - Account profile image files
# ./profile_background_images - Account profile background image files
```
================================================
FILE: ludwig/datasets/__init__.py
================================================
import argparse
import importlib
import logging
import os
from collections import OrderedDict
from functools import lru_cache
from io import BytesIO
from typing import Any, Literal
import yaml
from ludwig.api_annotations import DeveloperAPI, PublicAPI
from ludwig.backend.base import Backend
from ludwig.constants import AUDIO, BINARY, CATEGORY, IMAGE, NUMBER, TEST, TEXT, TRAIN, TYPE, VALIDATION
from ludwig.data.cache.types import CacheableDataframe
from ludwig.datasets import configs
from ludwig.datasets.dataset_config import DatasetConfig
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
# PublicAPI
from ludwig.datasets.utils import model_configs_for_dataset # noqa
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.print_utils import print_ludwig
from ludwig.utils.types import DataFrame
URI_PREFIX = "ludwig://"
HF_PREFIX = "hf://"
SPLITS = [TRAIN, VALIDATION, TEST]
def _load_dataset_config(config_filename: str):
"""Loads a dataset config."""
config_path = os.path.join(os.path.dirname(configs.__file__), config_filename)
with open(config_path) as f:
return DatasetConfig.from_dict(yaml.safe_load(f))
@lru_cache(maxsize=1)
def _get_dataset_configs() -> dict[str, DatasetConfig]:
"""Returns all dataset configs indexed by name."""
import importlib.resources
config_files = [f.name for f in importlib.resources.files(configs).iterdir() if f.name.endswith(".yaml")]
config_objects = [_load_dataset_config(f) for f in config_files]
return {c.name: c for c in config_objects}
def _get_dataset_config(dataset_name) -> DatasetConfig:
"""Get the config for a dataset."""
configs = _get_dataset_configs()
if dataset_name not in configs:
raise AttributeError(f"No config found for dataset {dataset_name}")
return configs[dataset_name]
@PublicAPI
def get_dataset(dataset_name, cache_dir=None) -> DatasetLoader:
"""Gets an instance of the dataset loader for a dataset."""
config = _get_dataset_config(dataset_name)
class_name = config.loader.split(".")[-1]
module_name = "." + ".".join(config.loader.split(".")[:-1])
loader_module = importlib.import_module(module_name, package="ludwig.datasets.loaders")
loader_cls = getattr(loader_module, class_name)
if cache_dir:
return loader_cls(config, cache_dir=cache_dir)
return loader_cls(config)
@DeveloperAPI
def load_dataset_uris(
dataset: str | DataFrame | None,
training_set: str | DataFrame | None,
validation_set: str | DataFrame | None,
test_set: str | DataFrame | None,
backend: Backend,
) -> tuple[
CacheableDataframe | None,
CacheableDataframe | None,
CacheableDataframe | None,
CacheableDataframe | None,
]:
"""Loads and returns any Ludwig dataset URIs as CacheableDataframes.
Returns the input unmodified for any non-Ludwig datasets.
"""
dataset_out, training_set_out, validation_set_out, test_set_out = dataset, training_set, validation_set, test_set
# Check that any of the datasets begin with the `hf://` prefix denoting a Hugging Face dataset URI
# Hugging Face datasets should follow the naming convention `hf://--`
if _is_hf(dataset, training_set):
return _load_hf_datasets(dataset, training_set, validation_set, test_set, backend)
# Check that any of the datasets begin with the `ludwig://` prefix denoting a Ludwig dataset URI
if dataset is not None:
if isinstance(dataset, str) and dataset.startswith(URI_PREFIX):
dataset_out = _load_cacheable_dataset(dataset, backend)
return dataset_out, training_set_out, validation_set_out, test_set_out
if training_set is not None:
train_df = test_df = val_df = None
training_set_checksum = None
if isinstance(training_set, str) and training_set.startswith(URI_PREFIX):
# For the training set, we only want to use the TRAINING split of the dataset
dataset_name = training_set[len(URI_PREFIX) :]
loader = get_dataset(dataset_name)
train_df, test_df, val_df = loader.load(split=True)
training_set_checksum = str(loader.get_mtime())
train_df = backend.df_engine.from_pandas(train_df)
training_set_out = CacheableDataframe(df=train_df, name=training_set, checksum=training_set_checksum)
if isinstance(validation_set, str) and validation_set.startswith(URI_PREFIX):
if validation_set == training_set:
# Reuse the loaded DF from the training split
val_df = backend.df_engine.from_pandas(val_df)
validation_set_out = CacheableDataframe(df=val_df, name=validation_set, checksum=training_set_checksum)
else:
validation_set_out = _load_cacheable_dataset(validation_set, backend)
if isinstance(test_set, str) and test_set.startswith(URI_PREFIX):
if test_set == training_set:
# Reuse the loaded DF from the training split
test_df = backend.df_engine.from_pandas(test_df)
test_set_out = CacheableDataframe(df=test_df, name=test_set, checksum=training_set_checksum)
else:
test_set_out = _load_cacheable_dataset(test_set, backend)
return dataset_out, training_set_out, validation_set_out, test_set_out
def _is_hf(dataset, training_set):
dataset_is_hf = dataset is not None and isinstance(dataset, str) and dataset.startswith(HF_PREFIX)
training_set_is_hf = (
training_set is not None and isinstance(training_set, str) and training_set.startswith(HF_PREFIX)
)
return dataset_is_hf or training_set_is_hf
def _load_hf_datasets(
dataset: str | DataFrame | None,
training_set: str | DataFrame | None,
validation_set: str | DataFrame | None,
test_set: str | DataFrame | None,
backend: Backend,
) -> tuple[
CacheableDataframe | None,
CacheableDataframe | None,
CacheableDataframe | None,
CacheableDataframe | None,
]:
"""Loads and returns any Hugging Face datasets as CacheableDataframes.
Returns the input unmodified for any non-HF datasets.
"""
dataset_out = dataset
training_set_out = training_set
validation_set_out = validation_set
test_set_out = test_set
# Check that any of the datasets begin with the `hf://` prefix denoting a Hugging Face dataset URI
# Hugging Face datasets should follow the naming convention `hf://--`
if dataset is not None:
if isinstance(dataset, str) and dataset.startswith(HF_PREFIX):
dataset_out = _load_cacheable_hf_dataset(dataset, backend)
return dataset_out, training_set_out, validation_set_out, test_set_out
# Because of the conditional logic (_is_hf) in load_dataset_uris, if the above block is not triggered, then
# training_set must be a string that starts with HF_PREFIX
train_df = test_df = val_df = None
loader = get_dataset("hugging_face")
hf_id, hf_subsample = _get_hf_dataset_and_subsample(training_set)
train_df, val_df, test_df = loader.load(hf_id, hf_subsample, split=True) # Call hugging_face loader
train_df = backend.df_engine.from_pandas(train_df)
training_set_out = CacheableDataframe(df=train_df, name=training_set, checksum=None)
if isinstance(validation_set, str) and validation_set.startswith(HF_PREFIX):
if validation_set == training_set:
# Reuse the loaded DF from the training split
val_df = backend.df_engine.from_pandas(val_df)
validation_set_out = CacheableDataframe(df=val_df, name=validation_set, checksum=None)
else: # This handles an edge case -- NOT EXPECTED USER BEHAVIOR
logging.warning(
"A Hugging Face validation set has been passed in that is different from the test set. "
"This is not recommended."
)
validation_set_out = _load_cacheable_hf_dataset(validation_set, backend, split_set=VALIDATION)
if isinstance(test_set, str) and test_set.startswith(HF_PREFIX):
if test_set == training_set:
# Reuse the loaded DF from the training split
test_df = backend.df_engine.from_pandas(test_df)
test_set_out = CacheableDataframe(df=test_df, name=test_set, checksum=None)
else: # This handles an edge case -- NOT EXPECTED USER BEHAVIOR
logging.warning(
"A Hugging Face test set has been passed in that is different from the training set. "
"This is not recommended."
)
test_set_out = _load_cacheable_hf_dataset(test_set, backend, split_set=TEST)
return dataset_out, training_set_out, validation_set_out, test_set_out
def _load_cacheable_hf_dataset(
dataset: str, backend: Backend, split_set: Literal["train", "validation", "test"] | None = None
) -> CacheableDataframe:
loader = get_dataset("hugging_face")
hf_id, hf_subsample = _get_hf_dataset_and_subsample(dataset)
if split_set:
train_df, validation_df, test_df = loader.load(hf_id, hf_subsample, split=True)
df = [train_df, validation_df, test_df][
SPLITS.index(split_set)
] # split_set should be one of TRAIN, VALIDATION, or TEST
else:
df = loader.load(hf_id, hf_subsample, split=False)
df = backend.df_engine.from_pandas(df)
return CacheableDataframe(df=df, name=dataset, checksum=None)
def _load_cacheable_dataset(dataset: str, backend: Backend) -> CacheableDataframe:
dataset_name = dataset[len(URI_PREFIX) :]
loader = get_dataset(dataset_name)
df = loader.load(split=False)
df = backend.df_engine.from_pandas(df)
return CacheableDataframe(df=df, name=dataset, checksum=str(loader.get_mtime()))
@PublicAPI
def list_datasets() -> list[str]:
"""Returns a list of the names of all available datasets."""
return sorted(_get_dataset_configs().keys())
@PublicAPI
def get_datasets_output_features(
dataset: str = None, include_competitions: bool = True, include_data_modalities: bool = False
) -> dict:
"""Returns a dictionary with the output features for each dataset. Optionally, you can pass a dataset name
which will then cause the function to return a dictionary with the output features for that dataset.
Because Hugging Face Datasets are loaded dynamically through a shared connector, they don't have fixed output
features. As such, we exclude Hugging Face datasets here.
:param dataset: (str) name of the dataset
:param include_competitions: (bool) whether to include the output features from kaggle competition datasets
:param include_data_modalities: (bool) whether to include the data modalities associated with the prediction task
:return: (dict) dictionary with the output features for each dataset or a dictionary with the output features for
the specified dataset
"""
ordered_configs = OrderedDict(sorted(_get_dataset_configs().items()))
competition_datasets = []
hugging_face_datasets = []
for name, config in ordered_configs.items():
if not include_competitions and config.kaggle_competition:
competition_datasets.append(name)
continue
if config.name == "hugging_face":
# There is no output_features attribute for hugging_face datasets
hugging_face_datasets.append(name)
continue
ordered_configs[name] = {"name": config.name, "output_features": config.output_features}
if include_data_modalities:
column_types = {column[TYPE] for column in config.columns}
data_modalities = set()
if NUMBER in column_types or CATEGORY in column_types or BINARY in column_types:
data_modalities.add("Tabular")
if TEXT in column_types:
data_modalities.add("Text")
if IMAGE in column_types:
data_modalities.add("Image")
if AUDIO in column_types:
data_modalities.add("Audio")
ordered_configs[name]["data_modalities"] = data_modalities
if dataset:
return ordered_configs[dataset]
if not include_competitions:
for competition in competition_datasets:
del ordered_configs[competition]
del ordered_configs["hugging_face"]
return ordered_configs
@PublicAPI
def describe_dataset(dataset_name: str) -> str:
"""Returns the description of the dataset."""
return _get_dataset_configs()[dataset_name].description
@PublicAPI
def download_dataset(dataset_name: str, output_dir: str = "."):
"""Downloads the dataset to the specified directory."""
output_dir = os.path.expanduser(os.path.normpath(output_dir))
dataset = get_dataset(dataset_name)
dataset.export(output_dir)
@DeveloperAPI
def get_buffer(dataset_name: str, kaggle_username: str = None, kaggle_key: str = None) -> BytesIO:
"""Returns a byte buffer for the specified dataset."""
try:
if dataset_name.startswith(HF_PREFIX):
hf_id, hf_subsample = _get_hf_dataset_and_subsample(dataset_name)
dataset = get_dataset("hugging_face").load(hf_id, hf_subsample)
else:
dataset = get_dataset(dataset_name).load(kaggle_username=kaggle_username, kaggle_key=kaggle_key)
buffer = BytesIO(dataset.to_parquet())
return buffer
except Exception as e:
logging.error(logging.ERROR, f"Failed to upload dataset {dataset_name}: {e}")
def _get_hf_dataset_and_subsample(dataset_name: str) -> tuple[str, str | None]:
"""Returns the Hugging Face ID and subsample name from the dataset name.
The dataset name should follow the format "{HF_PREFIX}{hf_id}--{hf_subsample}"
Examples (Dataset Name --> HF ID; HF subsample): "hf://wikisql" --> "wikisql"; None "hf://ColumbiaNLP/FLUTE" -->
"ColumbiaNLP/FLUTE"; None "hf://mstz/adult--income" --> "mstz/adult"; "income"
"""
dataset_name = dataset_name[len(HF_PREFIX) :]
dataset_name = dataset_name.split("--")
if len(dataset_name) == 1:
return dataset_name[0], None
return dataset_name[0], dataset_name[1]
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This command downloads and lists Ludwig-ready datasets.",
prog="ludwig datasets",
usage="%(prog)s [options]",
)
sub_parsers = parser.add_subparsers(dest="command", help="download and list datasets")
parser_download = sub_parsers.add_parser("download", help="download a dataset")
parser_download.add_argument("dataset", help="dataset to download")
parser_download.add_argument(
"-o",
"--output_dir",
type=str,
default=".",
help="output directory to download into",
required=False,
)
sub_parsers.add_parser("list", help="list datasets")
parser_describe = sub_parsers.add_parser("describe", help="describe datasets")
parser_describe.add_argument("dataset", help="dataset to describe")
args = parser.parse_args(sys_argv)
print_ludwig(f"Datasets {args.command}", LUDWIG_VERSION)
if args.command == "list":
datasets = list_datasets()
for ds in datasets:
print(ds)
elif args.command == "describe":
print(describe_dataset(args.dataset))
elif args.command == "download":
download_dataset(args.dataset, args.output_dir)
else:
raise ValueError(f"Unrecognized command: {args.command}")
def __getattr__(name: str) -> Any:
"""Module-level __getattr__ allows us to return an instance of a class. For example:
from ludwig.datasets import titanic
returns an instance of DatasetLoader configured to load titanic.
If you want to download a dataset in a non-default ludwig cache directory, there are two options:
1. set the LUDWIG_CACHE environment variable to your desired path before importing the dataset
2. Use ludwig.datasets.get_dataset(dataset_name, cache_dir=)
"""
public_methods = {
"list_datasets",
"describe_dataset",
"download_dataset",
"cli",
"get_dataset",
"model_configs_for_dataset",
}
if name in public_methods:
return globals()[name]
return get_dataset(name)
================================================
FILE: ludwig/datasets/archives.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import gzip
import logging
import os
import shutil
import tarfile
from enum import Enum
from zipfile import ZipFile
from ludwig.utils.fs_utils import upload_output_directory
logger = logging.getLogger(__name__)
class ArchiveType(str, Enum):
"""The type of file archive."""
UNKNOWN = "unknown"
ZIP = "zip"
GZIP = "gz"
TAR = "tar"
TAR_ZIP = "tar.z"
TAR_BZ2 = "tar.bz2"
TAR_GZ = "tar.gz"
def infer_archive_type(archive_path):
"""Try to infer archive type from file extension."""
# Get the path extension including multiple extensions, ex. ".tar.gz"
extension = ".".join(["", *os.path.basename(archive_path).split(".")[1:]])
extension = extension.lower()
if extension.endswith(".tar.z") or extension.endswith(".tar.zip"):
return ArchiveType.TAR_ZIP
elif extension.endswith(".tar.bz2") or extension.endswith(".tbz2"):
return ArchiveType.TAR_BZ2
elif extension.endswith(".tar.gz") or extension.endswith(".tgz"):
return ArchiveType.TAR_GZ
elif extension.endswith(".tar"):
return ArchiveType.TAR
elif extension.endswith(".zip") or extension.endswith(".zipx"):
return ArchiveType.ZIP
elif extension.endswith(".gz") or extension.endswith(".gzip"):
return ArchiveType.GZIP
else:
return ArchiveType.UNKNOWN
def is_archive(path):
"""Does this path a supported archive type."""
return infer_archive_type(path) != ArchiveType.UNKNOWN
def list_archive(archive_path, archive_type: ArchiveType | None = None) -> list[str]:
"""Return list of files extracted in an archive (without extracting them)."""
if archive_type is None:
archive_type = infer_archive_type(archive_path)
if archive_type == ArchiveType.UNKNOWN:
logger.error(
f"Could not infer type of archive {archive_path}. May be an unsupported archive type."
"Specify archive_type in the dataset config if this file has an unknown file extension."
)
return []
if archive_type == ArchiveType.ZIP:
with ZipFile(archive_path) as zfile:
return zfile.namelist()
elif archive_type == ArchiveType.GZIP:
return [".".join(archive_path.split(".")[:-1])] # Path minus the .gz extension
elif archive_type in {ArchiveType.TAR, ArchiveType.TAR_ZIP, ArchiveType.TAR_BZ2, ArchiveType.TAR_GZ}:
with tarfile.open(archive_path) as tar_file:
return tar_file.getnames()
else:
logger.error(f"Unsupported archive: {archive_path}")
return []
def extract_archive(archive_path: str, archive_type: ArchiveType | None = None) -> list[str]:
"""Extracts files from archive (into the same directory), returns a list of extracted files.
Args:
archive_path - The full path to the archive.
Returns A list of the files extracted.
"""
if archive_type is None:
archive_type = infer_archive_type(archive_path)
if archive_type == ArchiveType.UNKNOWN:
logger.error(
f"Could not infer type of archive {archive_path}. May be an unsupported archive type."
"Specify archive_type in the dataset config if this file has an unknown file extension."
)
return []
archive_directory = os.path.dirname(archive_path)
directory_contents_before = os.listdir(archive_directory)
with upload_output_directory(archive_directory) as (tmpdir, _):
if archive_type == ArchiveType.ZIP:
with ZipFile(archive_path) as zfile:
zfile.extractall(tmpdir)
elif archive_type == ArchiveType.GZIP:
gzip_content_file = ".".join(archive_path.split(".")[:-1]) # Path minus the .gz extension
with gzip.open(archive_path) as gzfile:
with open(os.path.join(tmpdir, gzip_content_file), "wb") as output:
shutil.copyfileobj(gzfile, output)
elif archive_type in {ArchiveType.TAR, ArchiveType.TAR_ZIP, ArchiveType.TAR_BZ2, ArchiveType.TAR_GZ}:
with tarfile.open(archive_path) as tar_file:
def is_within_directory(directory, target):
abs_directory = os.path.abspath(directory)
abs_target = os.path.abspath(target)
prefix = os.path.commonprefix([abs_directory, abs_target])
return prefix == abs_directory
def safe_extract(tar, path=".", members=None, *, numeric_owner=False):
for member in tar.getmembers():
member_path = os.path.join(path, member.name)
if not is_within_directory(path, member_path):
raise Exception("Attempted Path Traversal in Tar File")
tar.extractall(path, members, numeric_owner=numeric_owner)
safe_extract(tar_file, path=tmpdir)
else:
logger.error(f"Unsupported archive: {archive_path}")
directory_contents_after = set(os.listdir(archive_directory))
return directory_contents_after.difference(directory_contents_before)
================================================
FILE: ludwig/datasets/configs/__init__.py
================================================
================================================
FILE: ludwig/datasets/configs/adult_census_income.yaml
================================================
version: 1.0
name: adult_census_income
download_urls:
- https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
- https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test
train_filenames: adult.data
test_filenames: adult.test
sha256:
adult.data: 5b00264637dbfec36bdeaab5676b0b309ff9eb788d63554ca0a249491c86603d
adult.test: a2a9044bc167a35b2361efbabec64e89d69ce82d9790d2980119aac5fd7e9c05
loader: adult_census_income.AdultCensusIncomeLoader
description: |
Predict whether income exceeds $50K/yr based on census data
https://archive.ics.uci.edu/ml/datasets/adult
columns:
- name: age
type: number
- name: workclass
type: category
- name: fnlwgt
type: category
- name: education
type: category
- name: education-num
type: category
- name: marital-status
type: category
- name: occupation
type: category
- name: relationship
type: category
- name: race
type: category
- name: sex
type: category
- name: capital-gain
type: number
- name: capital-loss
type: number
- name: hours-per-week
type: number
- name: native-country
type: category
- name: income
type: category
output_features:
- name: income
type: binary
================================================
FILE: ludwig/datasets/configs/ae_price_prediction.yaml
================================================
version: 1.0
name: ae_price_prediction
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/ae_price_prediction/train.pq
- https://automl-mm-bench.s3.amazonaws.com/ae_price_prediction/test.pq
sha256:
test.pq: d05242580e011f3ac5a1a8f0069fd7788ceeacd6b2fb00ca7f409991f998c95e
train.pq: 181cfebbedd5c6e2bdc6261706103edddfc6eeb4604b8c6ffdc3d084a6e09a4e
train_filenames: train.pq
test_filenames: test.pq
description: |
Innerwear Data from Victoria's Secret and Others
600,000+ innerwear product data extracted from popular retail sites
https://www.kaggle.com/PromptCloudHQ/innerwear-data-from-victorias-secret-and-others
columns:
- name: product_name
type: category
- name: mrp
type: category
- name: price
type: number
- name: pdp_url
type: category
- name: brand_name
type: category
- name: product_category
type: category
- name: retailer
type: category
- name: description
type: text
- name: rating
type: number
- name: review_count
type: number
- name: style_attributes
type: set
- name: total_sizes
type: set
- name: available_size
type: set
- name: color
type: category
output_features:
- name: price
type: number
================================================
FILE: ludwig/datasets/configs/agnews.yaml
================================================
version: 1.0
name: agnews
download_urls:
- https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
- https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
train_filenames: train.csv
test_filenames: test.csv
sha256:
test.csv: 521465c2428ed7f02f8d6db6ffdd4b5447c1c701962353eb2c40d548c3c85699
train.csv: 76a0a2d2f92b286371fe4d4044640910a04a803fdd2538e0f3f29a5c6f6b672e
loader: agnews.AGNewsLoader
description: |
News articles categorized as "World", "Sports", "Business", and "Science".
columns:
- name: class_index
type: category
- name: title
type: text
- name: description
type: text
output_features:
- name: class_index
type: category
================================================
FILE: ludwig/datasets/configs/allstate_claims_severity.yaml
================================================
version: 1.0
name: allstate_claims_severity
kaggle_competition: allstate-claims-severity
archive_filenames: allstate-claims-severity.zip
sha256:
allstate-claims-severity.zip: 165f7b4bc5ed40f43656dc958da6572143a7e126e2d37bcd41f1299bfbaa68e2
train_filenames: train.csv
test_filenames: test.csv
loader: allstate_claims_severity.AllstateClaimsSeverityLoader
description: |
Allstate Claims Severity.
https://www.kaggle.com/c/allstate-claims-severity/overview
output_features:
- name: loss
type: number
================================================
FILE: ludwig/datasets/configs/alpaca.yaml
================================================
version: 1.0
name: alpaca
download_urls: https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
dataset_filenames: alpaca_data.json
description: |
Stanford Alpaca instruction-tuning dataset (https://github.com/tatsu-lab/stanford_alpaca) for LLM fine-tuning.
columns:
- name: instruction
type: text
- name: input
type: text
- name: output
type: text
output_features:
- name: output
type: text
================================================
FILE: ludwig/datasets/configs/amazon_employee_access_challenge.yaml
================================================
version: 1.0
name: amazon_employee_access_challenge
kaggle_competition: amazon-employee-access-challenge
archive_filenames: amazon-employee-access-challenge.zip
train_filenames: train.csv
test_filenames: test.csv
sha256:
amazon-employee-access-challenge.zip: bba1cf24bc01f390e7faf3f9cdbebd6267c875d51a36a2c625ce66e0c3e71db7
description: |
There is a considerable amount of data regarding an employee’s role within an organization and the resources to which
they have access. Given the data related to current employees and their provisioned access, models can be built that
automatically determine access privileges as employees enter and leave roles within a company.
https://www.kaggle.com/c/amazon-employee-access-challenge
output_features:
- name: ACTION
type: binary
================================================
FILE: ludwig/datasets/configs/amazon_review_polarity.yaml
================================================
version: 1.0
name: amazon_review_polarity
download_urls: https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz
train_filenames: amazon_review_polarity_csv/train.csv
test_filenames: amazon_review_polarity_csv/test.csv
sha256:
amazon_review_polarity_csv.tgz: d2a3ee7a214497a5d1b8eaed7c8d7ba2737de00ada3b0ec46243983efa100361
description: |
The Amazon Reviews Polarity dataset
Details:
34,686,770 Amazon reviews from 6,643,669 users on 2,441,053
products, from the Stanford Network Analysis Project (SNAP).
This dataset contains 600,000 training samples and 130,000
testing samples in each class.
Dataset source:
Character-level Convolutional Networks for Text Classification
Xiang Zhang et al., 2015
columns:
- name: label
type: binary
- name: review_title
type: text
- name: review_text
type: text
output_features:
- name: label
type: binary
================================================
FILE: ludwig/datasets/configs/amazon_reviews.yaml
================================================
version: 1.0
name: amazon_reviews
download_urls: https://s3.amazonaws.com/fast-ai-nlp/amazon_review_full_csv.tgz
train_filenames: amazon_review_full_csv/train.csv
test_filenames: amazon_review_full_csv/test.csv
sha256:
amazon_review_full_csv.tgz: 4af62eeee139d0142e0747340b68646d23483d9475c33ea0641ee9175b423443
description: |
The Amazon Reviews dataset
Details:
34,686,770 Amazon reviews from 6,643,669 users on 2,441,053
products, from the Stanford Network Analysis Project (SNAP).
This dataset contains 600,000 training samples and 130,000
testing samples in each class.
Dataset source:
Character-level Convolutional Networks for Text Classification
Xiang Zhang et al., 2015
columns:
- name: label
type: category
- name: review_title
type: text
- name: review_text
type: text
output_features:
- name: label
type: category
================================================
FILE: ludwig/datasets/configs/ames_housing.yaml
================================================
version: 1.0
name: ames_housing
kaggle_competition: house-prices-advanced-regression-techniques
archive_filenames: house-prices-advanced-regression-techniques.zip
train_filenames: train.csv
test_filenames: test.csv
sha256:
house-prices-advanced-regression-techniques.zip: 65f769a9157a2581671957ed08da8a8162d53e67b4e9970ee856b634deb11d9f
description: |
The Ames Housing dataset.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
output_features:
- name: SalePrice
type: number
fallback_mirrors:
- name: predibase
download_paths: s3://ludwig-tests/ludwig_backup/house-prices-advanced-regression-techniques.zip
================================================
FILE: ludwig/datasets/configs/bbcnews.yaml
================================================
version: 1.0
name: bbcnews
kaggle_competition: learn-ai-bbc
archive_filenames: learn-ai-bbc.zip
train_filenames: "BBC News Train.csv"
test_filenames: "BBC News Test.csv"
sha256:
learn-ai-bbc.zip: 450dd79c6654248af15d91d94c269fe7e8001effd89389f93c7184aac6699e62
description: |
BBC News Classification from Kaggle.
https://www.kaggle.com/competitions/learn-ai-bbc/overview
output_features:
- name: Category
type: category
================================================
FILE: ludwig/datasets/configs/bnp_claims_management.yaml
================================================
version: 1.0
name: bnp_claims_management
kaggle_competition: bnp-paribas-cardif-claims-management
archive_filenames: bnp-paribas-cardif-claims-management.zip
train_filenames: train.csv
test_filenames: test.csv
sha256:
bnp-paribas-cardif-claims-management.zip: c01a11ceae565bc95ec30a1ef4c9ffe4aa27e07d6e433776e90a4d5474f3e95d
description: |
The BNP Paribas Cardif Claims Management dataset.
https://www.kaggle.com/c/bnp-paribas-cardif-claims-management
output_features:
- name: target
type: binary
================================================
FILE: ludwig/datasets/configs/bookprice_prediction.yaml
================================================
version: 1.0
name: bookprice_prediction
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_price_of_books/train.csv
- https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_price_of_books/test.csv
sha256:
test.csv: 75bcc853efe734a53764127428e005bb9eb7585ad3dc1dce2eb284fa04313c1b
train.csv: dd978b591e623f9c5d4f9ade0f237200597afcad2c6417eb1e764698f1afcfcf
train_filenames: train.csv
test_filenames: test.csv
description: |
Here we explore a database of books of different genres, from thousands of authors.
In this challenge, participants are required to use the dataset to build a
Machine Learning model to predict the price of books based on a given set of features.
https://machinehack.com/hackathons/predict_the_price_of_books/overview
columns:
- name: Title
type: category
- name: Author
type: category
- name: Edition
type: category
- name: Reviews
type: number
- name: Ratings
type: number
- name: Synopsis
type: text
- name: Genre
type: category
- name: BookCategory
type: category
- name: Price
type: number
output_features:
- name: Price
type: number
================================================
FILE: ludwig/datasets/configs/california_house_price.yaml
================================================
version: 1.0
name: california_house_price
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/kaggle-california-house-prices/train.csv
- https://automl-mm-bench.s3.amazonaws.com/kaggle-california-house-prices/test.csv
sha256:
test.csv: b5bb9ed6e56cbdd0a410e186a19c6fe137c2ffbb50ba6b0808540434a8123dc6
train.csv: 907d45804e622fb136a9d55bde97269f421fb9b8f7c9f34416672cf7078ee94b
train_filenames: train.csv
test_filenames: test.csv
description: |
Predict house sale prices based on the house information, such as # of bedrooms,
living areas, locations, near-by schools, and the seller summary. The data consist
of houses sold in California in 2020, with houses in the test dataset sold after
the ones in the training dataset.
https://www.kaggle.com/c/california-house-prices
columns:
- name: Address
type: category
- name: Sold Price
type: number
- name: Summary
type: text
- name: Type
type: category
- name: Year built
type: number
- name: Heating
type: category
- name: Cooling
type: category
- name: Parking
type: category
- name: Lot
type: number
- name: Bedrooms
type: number
- name: Bathrooms
type: number
- name: Full bathrooms
type: number
- name: Total interior livable area
type: number
- name: Total spaces
type: number
- name: Garage spaces
type: number
- name: Region
type: category
- name: Elementary School
type: category
- name: Elementary School Score
type: number
- name: Elementary School Distance
type: number
- name: Middle School
type: category
- name: Middle School Score
type: number
- name: Middle School Distance
type: number
- name: High School
type: category
- name: High School Score
type: number
- name: High School Distance
type: number
- name: Flooring
type: set
- name: Heating features
type: set
- name: Cooling features
type: set
- name: Appliances included
type: set
- name: Laundry features
type: set
- name: Parking features
type: set
- name: Tax assessed value
type: number
- name: Annual tax amount
type: number
- name: Listed On
type: date
- name: Listed Price
type: number
- name: Last Sold On
type: date
- name: Last Sold Price
type: number
- name: City
type: category
- name: Zip
type: category
- name: State
type: category
output_features:
- name: Sold Price
type: number
================================================
FILE: ludwig/datasets/configs/camseq.yaml
================================================
version: 1.0
name: camseq
kaggle_dataset_id: carlolepelaars/camseq-semantic-segmentation
archive_filenames: camseq-semantic-segmentation.zip
sha256:
camseq-semantic-segmentation.zip: ea3aeba2661d9b3e3ea406668e7d9240cb2ba0c7e374914bb6d866147faff502
loader: camseq.CamseqLoader
preserve_paths:
- images
- masks
description: |
CamSeq01 Cambridge Labeled Objects in Video
https://www.kaggle.com/datasets/carlolepelaars/camseq-semantic-segmentation
columns:
- name: image_path
type: image
- name: mask_path
type: image
output_features:
- name: mask_path
type: image
================================================
FILE: ludwig/datasets/configs/code_alpaca.yaml
================================================
version: 1.0
name: code_alpaca
download_urls: https://raw.githubusercontent.com/sahil280114/codealpaca/master/data/code_alpaca_20k.json
train_filenames: code_alpaca_20k.json
loader: code_alpaca_loader.CodeAlpacaLoader
description: |
This dataset, created by sahil280114, aims to build and share an instruction-following LLaMA model for code generation. The repo containing
this dataset is fully based on Stanford Alpaca, and only changes the data used for training.
columns:
- name: instruction
type: text
- name: input
type: text
- name: output
type: text
output_features:
- name: output
type: text
================================================
FILE: ludwig/datasets/configs/connect4.yaml
================================================
version: 1.0
name: connect4
kaggle_dataset_id: tbrewer/connect-4
archive_filenames: connect-4.zip
dataset_filenames: c4_game_database.csv
sha256:
connect-4.zip: 46c33c47f2664948a4abe53bafee92a602773f31db615bc8bd239e1f98a3d2cf
description: |
Each row represents the end results of a Connect-4 game.
Columns 1-42 are the positions on the grid from left to right, top to bottom. Each element in these columns represent to player's piece : 1, and -1, 0 marks an empty cell.
Column 43 marks the winner of the game : -1, 1, and 0 for tie games.
columns:
- name: pos_01
type: number
- name: pos_02
type: number
- name: pos_03
type: number
- name: pos_04
type: number
- name: pos_05
type: number
- name: pos_06
type: number
- name: pos_07
type: number
- name: pos_08
type: number
- name: pos_09
type: number
- name: pos_10
type: number
- name: pos_11
type: number
- name: pos_12
type: number
- name: pos_13
type: number
- name: pos_14
type: number
- name: pos_15
type: number
- name: pos_16
type: number
- name: pos_17
type: number
- name: pos_18
type: number
- name: pos_19
type: number
- name: pos_20
type: number
- name: pos_21
type: number
- name: pos_22
type: number
- name: pos_23
type: number
- name: pos_24
type: number
- name: pos_25
type: number
- name: pos_26
type: number
- name: pos_27
type: number
- name: pos_28
type: number
- name: pos_29
type: number
- name: pos_30
type: number
- name: pos_31
type: number
- name: pos_32
type: number
- name: pos_33
type: number
- name: pos_34
type: number
- name: pos_35
type: number
- name: pos_36
type: number
- name: pos_37
type: number
- name: pos_38
type: number
- name: pos_39
type: number
- name: pos_40
type: number
- name: pos_41
type: number
- name: pos_42
type: number
- name: winner
type: number
output_features:
- name: winner
type: category
================================================
FILE: ludwig/datasets/configs/consumer_complaints.yaml
================================================
version: 1.0
name: consumer_complaints
kaggle_dataset_id: selener/consumer-complaint-database
archive_filenames: consumer-complaint-database.zip
dataset_filenames: rows.csv
loader: consumer_complaints_loader.ConsumerComplaintsLoader
description: |
The dataset contains different information of complaints that customers have made about a multiple products and
services in the financial sector, such us Credit Reports, Student Loans, Money Transfer, etc. The date of each
complaint ranges from November 2011 to May 2019.
columns:
- name: Date received
type: Date
- name: Product
type: text
- name: Sub-product
type: text
- name: Issue
type: text
- name: Sub-issue
type: text
- name: Consumer complaint narrative
type: text
- name: Company public response
type: text
- name: Company
type: text
- name: State
type: category
- name: ZIP code
type: category
- name: Tags
type: category
- name: Consumer consent provided?
type: text
- name: Submitted via
type: category
- name: Date sent to company
type: date
- name: Company response to consumer
type: text
- name: Timely response?
type: binary
- name: Consumer disputed?
type: binary
- name: Complaint ID
type: number
output_features:
- name: Issue
type: text
================================================
FILE: ludwig/datasets/configs/consumer_complaints_generation.yaml
================================================
version: 1.0
name: consumer_complaints_generation
download_urls: https://predibase-public-us-west-2.s3.us-west-2.amazonaws.com/datasets/consumer_complaints_gen_tutorial.csv
train_filenames: consumer_complaints_gen_tutorial.csv
description: |
The dataset contains different information of complaints that customers have made about a multiple products and
services in the financial sector, such us Credit Reports, Student Loans, Money Transfer, etc. The date of each
complaint ranges from November 2011 to May 2019. The dataset has been modified to be used for text generation.
We have added a structured JSON field that contains a company generated response to the raised complaint. The idea
is to fine-tune an LLM to generate this output JSON field.
columns:
- name: Complaint ID
type: number
- name: Date received
type: Date
- name: Product
type: text
- name: Issue
type: text
- name: Complaint
type: text
- name: Company Response
type: text
- name: Structured JSON Output
type: text
output_features:
- name: Structured JSON Output
type: text
================================================
FILE: ludwig/datasets/configs/creditcard_fraud.yaml
================================================
version: 1.0
name: creditcard_fraud
kaggle_dataset_id: mlg-ulb/creditcardfraud
archive_filenames: creditcardfraud.zip
sha256:
creditcardfraud.zip: a0360ce715992212e9ac72d8ccdca97f4be87dc1fdf2bed011358f7ab409a28a
loader: creditcard_fraud.CreditCardFraudLoader
description: |
The Machine Learning Group ULB Dataset
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
columns:
- name: Time
type: number
- name: V1
type: number
- name: V2
type: number
- name: V3
type: number
- name: V4
type: number
- name: V5
type: number
- name: V6
type: number
- name: V7
type: number
- name: V8
type: number
- name: V9
type: number
- name: V10
type: number
- name: V11
type: number
- name: V12
type: number
- name: V13
type: number
- name: V14
type: number
- name: V15
type: number
- name: V16
type: number
- name: V17
type: number
- name: V18
type: number
- name: V19
type: number
- name: V20
type: number
- name: V21
type: number
- name: V22
type: number
- name: V23
type: number
- name: V24
type: number
- name: V25
type: number
- name: V26
type: number
- name: V27
type: number
- name: V28
type: number
- name: Amount
type: number
- name: Class
type: number
output_features:
- name: Class
type: binary
================================================
FILE: ludwig/datasets/configs/customer_churn_prediction.yaml
================================================
version: 1.0
name: customer_churn_prediction
kaggle_competition: customer-churn-prediction-2020
archive_filenames: customer-churn-prediction-2020.zip
train_filenames: train.csv
test_filenames: test.csv
sha256:
customer-churn-prediction-2020.zip: fb5cbc787081a6a559592230c657a0520a181447da6eb2adc34a3aebbe8ed9ca
description: |
Dataset from a Kaggle competition that is about predicting whether a customer will change
telecommunications provider, something known as "churning".
https://www.kaggle.com/c/customer-churn-prediction-2020
output_features:
- name: churn
type: binary
================================================
FILE: ludwig/datasets/configs/data_scientist_salary.yaml
================================================
version: 1.0
name: data_scientist_salary
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_data_scientists_salary_in_india_hackathon/train.csv
- https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_data_scientists_salary_in_india_hackathon/test.csv
sha256:
test.csv: 244c215f4a03cae4b107e76c7fe94269728450cabf44c943415211ce7d6437df
train.csv: 99d6aa80505ac1311e97f402d5723996119e859c7f3fce261350462148debe3d
train_filenames: train.csv
test_filenames: test.csv
description: |
The training data and test data comprise of 19802 samples and of 6601 samples each from the
Analytics India Annual Salary Study.
https://machinehack.com/hackathons/predict_the_data_scientists_salary_in_india_hackathon/overview
columns:
- name: experience
type: category
- name: job_description
type: text
- name: job_desig
type: category
- name: job_type
type: category
- name: key_skills
type: set
- name: location
type: category
- name: salary
type: category
output_features:
- name: salary
type: category
================================================
FILE: ludwig/datasets/configs/dbpedia.yaml
================================================
version: 1.0
name: dbpedia
download_urls: https://s3.amazonaws.com/fast-ai-nlp/dbpedia_csv.tgz
train_filenames: dbpedia_csv/train.csv
test_filenames: dbpedia_csv/test.csv
sha256:
dbpedia_csv.tgz: 42db5221ddedddb673a4cabcc5f3a7d869714c878bcfe4ba94b29d14aa38e417
description: |
The DBPedia Ontology dataset.
Details:
40,000 training samples and 5,000 testing samples from 14
nonoverlapping classes from DBpedia 2014.
Dataset source:
Character-level Convolutional Networks for Text Classification
Xiang Zhang et al., 2015
columns:
- name: label
type: category
- name: title
type: category
- name: content
type: text
output_features:
- name: label
type: category
================================================
FILE: ludwig/datasets/configs/electricity.yaml
================================================
version: 1.0
name: electricity
download_urls: https://raw.githubusercontent.com/nimz/electricity_demand/master/elecdemand.csv
sha256:
elecdemand.csv: 4fd3c8a4b8168f34703b55313c5341f8e8385810a54f1a1cdf6987c1904c9698
description: |
Electricity demand dataset. Half-hourly electricity demand in Victoria, Australia during 2014, along with
Melbourne temperatures.
Source textbook:
Forecasting: Principles and Practice
Rob J Hyndman and George Athanasopoulos
columns:
- name: Demand
type: number
- name: WorkDay
type: binary
- name: Temperature
type: number
output_features:
- name: Demand
type: number
================================================
FILE: ludwig/datasets/configs/ethos_binary.yaml
================================================
version: 1.1
name: ethos_binary
download_urls:
- https://raw.githubusercontent.com/intelligence-csd-auth-gr/Ethos-Hate-Speech-Dataset/master/ethos/ethos_data/Ethos_Dataset_Binary.csv
sha256:
Ethos_Dataset_Binary.csv: 0cd0050c2592afcb5eca5876df485ca15cda9d7d16fe32c269857260fd10d96c
loader: ethos_binary.EthosBinaryLoader
description: |
The Ethos Hate Speech Dataset.
Source Paper:
ETHOS: an Online Hate Speech Detection Dataset
Ioannis Mollas and Zoe Chrysopoulou and Stamatis Karlos and
Grigorios Tsoumakas
columns:
- name: comment
type: text
- name: isHate
type: binary
output_features:
- name: isHate
type: binary
================================================
FILE: ludwig/datasets/configs/fake_job_postings2.yaml
================================================
version: 1.0
name: fake_job_postings2
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/fake_job_postings2/train.csv
- https://automl-mm-bench.s3.amazonaws.com/fake_job_postings2/test.csv
sha256:
test.csv: a5296f49129d440434e6274bb892a1320fe1dd4c26d5a1b085786d5ea1133dd8
train.csv: b6568e415ad49cb7bd23848dfbb8d381f9de590e133a5075abbf4c1a7c7c1711
train_filenames: train.csv
test_filenames: test.csv
description: |
This dataset contains 18K job descriptions out of which about 800 are fake.
The data consists of both textual information and meta-information about the jobs.
This dataset is "fake_job_postings2" in the AutoGluon paper.
https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction
columns:
- name: title
type: category
- name: salary_range
type: category
- name: description
type: text
- name: required_experience
type: category
- name: required_education
type: category
- name: fraudulent
type: binary
output_features:
- name: fraudulent
type: binary
================================================
FILE: ludwig/datasets/configs/fever.yaml
================================================
version: 1.0
name: fever
download_urls:
- https://fever.ai/download/fever/train.jsonl
- https://fever.ai/download/fever/paper_dev.jsonl
- https://fever.ai/download/fever/paper_test.jsonl
sha256:
train.jsonl: eba7e8f87076753f8494718b9a857827af7bf73e76c9e4b75420207d26e588b6
paper_test.jsonl: fb7b0280a0adc2302bbb29bfb7af37274fa585de3171bcf908f180642d11d88e
paper_dev.jsonl: 41158707810008747946bf23471e82df53e77a513524b9e3ec1c2e674ef5ef8c
train_filenames: train.jsonl
test_filenames: paper_test.jsonl
validation_filenames: paper_dev.jsonl
column_types:
evidence: str
description: |
FEVER: a Large-scale Dataset for Fact Extraction and VERification
columns:
- name: id
type: category
- name: verifiable
type: category
- name: label
type: category
- name: label
type: category
- name: claim
type: text
- name: evidence
type: category
- name: label
type: category
output_features:
- name: label
type: category
================================================
FILE: ludwig/datasets/configs/flickr8k.yaml
================================================
version: 1.0
name: flickr8k
download_urls:
- https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip
- https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip
dataset_filenames: flickr8k_dataset.csv
preserve_paths: Flicker8k_Dataset
sha256:
Flickr8k_Dataset.zip: 61e4b111d32b24a55b69dafd91f4c3aec07391b7b9217face15dd35d517fe6de
Flickr8k_text.zip: 4992ddc8110e9aa49da5bf698522b0c8f11c448814a488584ee6bf040e5137e7
loader: flickr8k.Flickr8kLoader
description: |
A new benchmark collection for sentence-based image description and search,
consisting of 8,000 images that are each paired with five different
captions which provide clear descriptions of the salient entities and
events. The images were chosen from six different Flickr groups, and tend
not to contain any well-known people or locations, but were manually
selected to depict a variety of scenes and situations.
output_features:
- name: caption0
type: text
- name: caption1
type: text
- name: caption2
type: text
- name: caption3
type: text
- name: caption4
type: text
================================================
FILE: ludwig/datasets/configs/forest_cover.yaml
================================================
version: 1.0
name: forest_cover
download_urls: https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz
sha256:
covtype.data.gz: 614360d0257557dd1792834a85a1cdebfadc3c4f30b011d56afee7ffb5b15771
dataset_filenames: covtype.data
loader: forest_cover.ForestCoverLoader
description: |
The Forest Cover Type dataset.
Predicting forest cover type from cartographic variables only.
https://archive.ics.uci.edu/ml/datasets/covertype
columns:
- name: Elevation
type: number
- name: Aspect
type: number
- name: Slope
type: number
- name: Horizontal_Distance_To_Hydrology
type: number
- name: Vertical_Distance_To_Hydrology
type: number
- name: Horizontal_Distance_To_Roadways
type: number
- name: Hillshade_9am
type: number
- name: Hillshade_Noon
type: number
- name: Hillshade_3pm
type: number
- name: Horizontal_Distance_To_Fire_Points
type: number
- name: Wilderness_Area
type: category
- name: Soil_Type
type: category
- name: Cover_Type
type: category
output_features:
- name: Cover_Type
type: category
================================================
FILE: ludwig/datasets/configs/goemotions.yaml
================================================
version: 1.0
name: goemotions
download_urls:
- https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/train.tsv
- https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/dev.tsv
- https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/test.tsv
train_filenames: train.tsv
validation_filenames: dev.tsv
test_filenames: test.tsv
sha256:
train.tsv: 1c254a142be5c00e80d819b9ae1bbd36d94b2eeb8f4b1271846508d57e57d9c5
dev.tsv: 575489c079c9de1097062a01738f998590d6b7ead66dd1c9fd1d2ba01fd8bc62
test.tsv: 0587b2dd8b27b97352adbfc3fb083d46005c8946657fdc2b1ca8b1cc7f1f8be4
loader: goemotions.GoEmotionsLoader
description: |
GoEmotions: A Dataset for Fine-Grained Emotion Classification.
https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html
columns:
- name: text
type: text
- name: emotion_ids
type: category
- name: comment_id
type: category
output_features:
- name: emotion_ids
type: category
================================================
FILE: ludwig/datasets/configs/goodbooks_books.yaml
================================================
version: 1.0
name: goodbooks_books
download_urls:
- https://github.com/zygmuntz/goodbooks-10k/releases/download/v1.0/goodbooks-10k.zip
sha256:
goodbooks-10k.zip: 261b97b56db61f3fb2ce5aadbb13704d30179fcc986c17ace665a0af9ed00731
dataset_filenames: books.csv
description: |
goodbooks_books is a multimodal dataset of 10K books, taken from the goodreads dataset.
The Goodbooks-10K dataset contains six million ratings for ten thousand most popular (with most ratings) books.
The dataset also contains:
books marked to read by the users
book metadata (author, year, etc.)
tags/shelves/genres
https://github.com/zygmuntz/goodbooks-10k
columns:
- name: book_id
type: category
- name: goodreads_book_id
type: category
- name: best_book_id
type: category
- name: work_id
type: category
- name: books_count
type: number
- name: isbn
type: category
- name: isbn13
type: category
- name: authors
type: category
- name: original_publication_year
type: category
- name: original_title
type: category
- name: title
type: category
- name: language_code
type: category
- name: average_rating
type: number
- name: ratings_count
type: number
- name: work_ratings_count
type: number
- name: work_text_reviews_count
type: number
- name: ratings_1
type: number
- name: ratings_2
type: number
- name: ratings_3
type: number
- name: ratings_4
type: number
- name: ratings_5
type: number
- name: image_url
type: image
- name: small_image_url
type: image
output_features:
- name: average_rating
type: number
- name: ratings_1
type: number
- name: ratings_2
type: number
- name: ratings_3
type: number
- name: ratings_4
type: number
- name: ratings_5
type: number
================================================
FILE: ludwig/datasets/configs/google_qa_answer_type_reason_explanation.yaml
================================================
version: 1.0
name: google_qa_answer_type_reason_explanation
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/google_quest_qa/train.pq
- https://automl-mm-bench.s3.amazonaws.com/google_quest_qa/dev.pq
sha256:
train.pq: 92274286ffb759c96bfca77001c10eb323b3531db3a0e178813db9b82e80a12a
dev.pq: 2e66450215b94dc404eadc7dde83a1eabad9640d946863c298aa2d42c998ed84
train_filenames: train.pq
test_filenames: dev.pq
description: |
Google QUEST Q&A Labeling
Improving automated understanding of complex question answer content.
The data for this competition includes questions and answers from various StackExchange properties.
https://www.kaggle.com/c/google-quest-challenge/data
Note: this is the same dataset as `google_quest_qa`. It is duplicated here to have a one-to-one mapping
with the benchmarking datasets in https://arxiv.org/pdf/2111.02705.pdf
In this paper, the column `answer_type_reason_explanation` is used as the output feature.
columns:
- name: qa_id
type: category
- name: question_title
type: text
- name: question_body
type: text
- name: question_user_name
type: category
- name: question_user_page
type: category
- name: answer
type: text
- name: answer_user_name
type: category
- name: answer_user_page
type: category
- name: url
type: category
- name: category
type: category
- name: host
type: category
- name: question_asker_intent_understanding
type: number
- name: question_body_critical
type: number
- name: question_conversational
type: number
- name: question_expect_short_answer
type: number
- name: question_fact_seeking
type: number
- name: question_has_commonly_accepted_answer
type: number
- name: question_interestingness_others
type: number
- name: question_interestingness_self
type: number
- name: question_multi_intent
type: number
- name: question_not_really_a_question
type: number
- name: question_opinion_seeking
type: number
- name: question_type_choice
type: number
- name: question_type_compare
type: number
- name: question_type_consequence
type: number
- name: question_type_definition
type: number
- name: question_type_entity
type: number
- name: question_type_instructions
type: number
- name: question_type_procedure
type: number
- name: question_type_reason_explanation
type: number
- name: question_type_spelling
type: number
- name: question_well_written
type: number
- name: answer_helpful
type: number
- name: answer_level_of_information
type: number
- name: answer_plausible
type: number
- name: answer_relevance
type: number
- name: answer_satisfaction
type: number
- name: answer_type_instructions
type: number
- name: answer_type_procedure
type: number
- name: answer_type_reason_explanation
type: number
- name: answer_well_written
type: number
output_features:
- name: answer_type_reason_explanation
type: number
================================================
FILE: ludwig/datasets/configs/google_qa_question_type_reason_explanation.yaml
================================================
version: 1.0
name: google_qa_question_type_reason_explanation
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/google_quest_qa/train.pq
- https://automl-mm-bench.s3.amazonaws.com/google_quest_qa/dev.pq
sha256:
train.pq: 92274286ffb759c96bfca77001c10eb323b3531db3a0e178813db9b82e80a12a
dev.pq: 2e66450215b94dc404eadc7dde83a1eabad9640d946863c298aa2d42c998ed84
train_filenames: train.pq
test_filenames: dev.pq
description: |
Google QUEST Q&A Labeling
Improving automated understanding of complex question answer content.
The data for this competition includes questions and answers from various StackExchange properties.
https://www.kaggle.com/c/google-quest-challenge/data
Note: this is the same dataset as `google_quest_qa`. It is duplicated here to have a one-to-one mapping
with the benchmarking datasets in https://arxiv.org/pdf/2111.02705.pdf
In this paper, the column `question_type_reason_explanation` is used as the output feature.
columns:
- name: qa_id
type: category
- name: question_title
type: text
- name: question_body
type: text
- name: question_user_name
type: category
- name: question_user_page
type: category
- name: answer
type: text
- name: answer_user_name
type: category
- name: answer_user_page
type: category
- name: url
type: category
- name: category
type: category
- name: host
type: category
- name: question_asker_intent_understanding
type: number
- name: question_body_critical
type: number
- name: question_conversational
type: number
- name: question_expect_short_answer
type: number
- name: question_fact_seeking
type: number
- name: question_has_commonly_accepted_answer
type: number
- name: question_interestingness_others
type: number
- name: question_interestingness_self
type: number
- name: question_multi_intent
type: number
- name: question_not_really_a_question
type: number
- name: question_opinion_seeking
type: number
- name: question_type_choice
type: number
- name: question_type_compare
type: number
- name: question_type_consequence
type: number
- name: question_type_definition
type: number
- name: question_type_entity
type: number
- name: question_type_instructions
type: number
- name: question_type_procedure
type: number
- name: question_type_reason_explanation
type: number
- name: question_type_spelling
type: number
- name: question_well_written
type: number
- name: answer_helpful
type: number
- name: answer_level_of_information
type: number
- name: answer_plausible
type: number
- name: answer_relevance
type: number
- name: answer_satisfaction
type: number
- name: answer_type_instructions
type: number
- name: answer_type_procedure
type: number
- name: answer_type_reason_explanation
type: number
- name: answer_well_written
type: number
output_features:
- name: question_type_reason_explanation
type: number
================================================
FILE: ludwig/datasets/configs/google_quest_qa.yaml
================================================
version: 1.0
name: google_quest_qa
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/google_quest_qa/train.pq
- https://automl-mm-bench.s3.amazonaws.com/google_quest_qa/dev.pq
- https://automl-mm-bench.s3.amazonaws.com/google_quest_qa/test.pq
sha256:
test.pq: cb1bb5f32374d83ad4ef7feb4e443c9376cdd919cda40057732ef500e9a4ecf3
train.pq: 92274286ffb759c96bfca77001c10eb323b3531db3a0e178813db9b82e80a12a
dev.pq: 2e66450215b94dc404eadc7dde83a1eabad9640d946863c298aa2d42c998ed84
train_filenames: train.pq
validation_filenames: dev.pq
test_filenames: test.pq
description: |
Google QUEST Q&A Labeling
Improving automated understanding of complex question answer content.
The data for this competition includes questions and answers from various StackExchange properties.
https://www.kaggle.com/c/google-quest-challenge/data
columns:
- name: qa_id
type: category
- name: question_title
type: text
- name: question_body
type: text
- name: question_user_name
type: category
- name: question_user_page
type: category
- name: answer
type: text
- name: answer_user_name
type: category
- name: answer_user_page
type: category
- name: url
type: category
- name: category
type: category
- name: host
type: category
- name: question_asker_intent_understanding
type: number
- name: question_body_critical
type: number
- name: question_conversational
type: number
- name: question_expect_short_answer
type: number
- name: question_fact_seeking
type: number
- name: question_has_commonly_accepted_answer
type: number
- name: question_interestingness_others
type: number
- name: question_interestingness_self
type: number
- name: question_multi_intent
type: number
- name: question_not_really_a_question
type: number
- name: question_opinion_seeking
type: number
- name: question_type_choice
type: number
- name: question_type_compare
type: number
- name: question_type_consequence
type: number
- name: question_type_definition
type: number
- name: question_type_entity
type: number
- name: question_type_instructions
type: number
- name: question_type_procedure
type: number
- name: question_type_reason_explanation
type: number
- name: question_type_spelling
type: number
- name: question_well_written
type: number
- name: answer_helpful
type: number
- name: answer_level_of_information
type: number
- name: answer_plausible
type: number
- name: answer_relevance
type: number
- name: answer_satisfaction
type: number
- name: answer_type_instructions
type: number
- name: answer_type_procedure
type: number
- name: answer_type_reason_explanation
type: number
- name: answer_well_written
type: number
output_features:
- name: question_type_reason_explanation
type: category
================================================
FILE: ludwig/datasets/configs/higgs.yaml
================================================
version: 1.0
name: higgs
download_urls: https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
sha256:
HIGGS.csv.gz: ea302c18164d4e3d916a1e2e83a9a8d07069fa6ebc7771e4c0540d54e593b698
column_types:
label: int32
loader: higgs.HiggsLoader
description: |
The Higgs Boson dataset.
This is a classification problem to distinguish between a signal process
which produces Higgs bosons and a background process which does not.
https://archive.ics.uci.edu/ml/datasets/HIGGS
columns:
- name: label
type: binary
- name: lepton_pT
type: number
- name: lepton_eta
type: number
- name: lepton_phi
type: number
- name: missing_energy_magnitude
type: number
- name: missing_energy_phi
type: number
- name: jet_1_pt
type: number
- name: jet_1_eta
type: number
- name: jet_1_phi
type: number
- name: jet_1_b-tag
type: number
- name: jet_2_pt
type: number
- name: jet_2_eta
type: number
- name: jet_2_phi
type: number
- name: jet_2_b-tag
type: number
- name: jet_3_pt
type: number
- name: jet_3_eta
type: number
- name: jet_3_phi
type: number
- name: jet_3_b-tag
type: number
- name: jet_4_pt
type: number
- name: jet_4_eta
type: number
- name: jet_4_phi
type: number
- name: jet_4_b-tag
type: number
- name: m_jj
type: number
- name: m_jjj
type: number
- name: m_lv
type: number
- name: m_jlv
type: number
- name: m_bb
type: number
- name: m_wbb
type: number
- name: m_wwbb
type: number
output_features:
- name: label
type: binary
================================================
FILE: ludwig/datasets/configs/hugging_face.yaml
================================================
version: 1.0
name: hugging_face
loader: hugging_face.HFLoader
description: |
Hugging Face Datasets
================================================
FILE: ludwig/datasets/configs/ieee_fraud.yaml
================================================
version: 1.0
name: ieee_fraud
kaggle_competition: ieee-fraud-detection
archive_filenames: ieee-fraud-detection.zip
sha256:
ieee-fraud-detection.zip: 4cc646da09d0a9b265983ffed775b1f9ee15af5266586df610e04d6adae0b829
train_filenames:
- train_identity.csv
- train_transaction.csv
test_filenames:
- test_identity.csv
- test_transaction.csv
loader: ieee_fraud.IEEEFraudLoader
description: |
The IEEE-CIS Fraud Detection Dataset
https://www.kaggle.com/c/ieee-fraud-detection/overview.
output_features:
- name: isFraud
type: binary
================================================
FILE: ludwig/datasets/configs/imbalanced_insurance.yaml
================================================
version: 1.0
name: imbalaced_insurance
kaggle_dataset_id: arashnic/imbalanced-data-practice
archive_filenames: imbalanced-data-practice.zip
sha256:
imbalanced-data-practice.zip: 33c7d15cbdb7cc151c1d5e920a8a613b015c19222f90d4eac04ca8cfc5416847
dataset_filenames: aug_train.csv
loader: split_loaders.RandomSplitLoader
description: |
Health Insurance Cross Sell Prediction
Predict Health Insurance Owners' who will be interested in Vehicle Insurance
https://www.kaggle.com/datasets/arashnic/imbalanced-data-practice
columns:
- name: id
type: category
- name: Gender
type: binary
- name: Age
type: number
- name: Driving_License
type: binary
- name: Region_Code
type: category
- name: Previously_Insured
type: binary
- name: Vehicle_Age
type: category
- name: Vehicle_Damage
type: binary
- name: Annual_Premium
type: number
- name: Policy_Sales_Channel
type:
- name: Vintage
type:
- name: Response
type:
output_features:
- name: Response
type: binary
================================================
FILE: ludwig/datasets/configs/imdb.yaml
================================================
version: 1.0
name: imdb
kaggle_dataset_id: lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
archive_filenames: imdb-dataset-of-50k-movie-reviews.zip
sha256:
imdb-dataset-of-50k-movie-reviews.zip: 73a235bc5fc4df57bb5d517afa480fe6bfd4e2afc25dc5e5867fc87f2d25614d
description: |
IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
columns:
- name: review
type: text
- name: sentiment
type: category
output_features:
- name: sentiment
type: binary
================================================
FILE: ludwig/datasets/configs/imdb_genre_prediction.yaml
================================================
version: 1.0
name: imdb_genre_prediction
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/imdb_genre_prediction/train.csv
- https://automl-mm-bench.s3.amazonaws.com/imdb_genre_prediction/test.csv
sha256:
test.csv: 5bca7b6ca34f4057e2a4920d6034f481055bd03061bb0128c87d6c99a6b4661f
train.csv: b63f1f6fcad17f644d9266891a01d0f0e1187c277ccf6eecb80af72b92b0b621
train_filenames: train.csv
test_filenames: test.csv
description: |
A data set of 1,000 most popular movies on IMDB in the last 10 years. The data points included are:
Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue, Metascrore
https://www.kaggle.com/PromptCloudHQ/imdb-data
columns:
- name: Rank
type: number
- name: Title
type: category
- name: Description
type: text
- name: Director
type: category
- name: Actors
type: set
- name: Year
type: category
- name: Runtime (Minutes)
type: number
- name: Rating
type: Number
- name: Votes
type: number
- name: Revenue (Millions)
type: number
- name: Metascore
type: number
- name: Genre_is_Drama
type: binary
output_features:
- name: Genre_is_Drama
type: binary
================================================
FILE: ludwig/datasets/configs/insurance_lite.yaml
================================================
version: 1.0
name: insurance_lite
kaggle_dataset_id: infernape/fast-furious-and-insured
archive_filenames: fast-furious-and-insured.zip
sha256:
fast-furious-and-insured.zip: 3b88ada517aa88d9c9187121d7ef42f4b5539808677a2b0827b989ca0fa19600
dataset_filenames: Fast_Furious_Insured/train.csv
preserve_paths: Fast_Furious_Insured
loader: insurance_lite.InsuranceLiteLoader
description: |
The dataset consists of parameters such as the images of damaged cars,
the price of the cars and their insurance claim, and the like.
Predict the insurance claim for the cars that are provided in the dataset.
columns:
- name: image_path
type: image
- name: insurance_company
type: category
- name: cost_of_vehicle
type: number
- name: min_coverage
type: number
- name: expiry_date
type: date
- name: max_coverage
type: number
- name: condition
type: binary
- name: amount
type: number
output_features:
- name: amount
type: number
================================================
FILE: ludwig/datasets/configs/iris.yaml
================================================
version: 1.0
name: iris
download_urls: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
sha256:
iris.data: 6f608b71a7317216319b4d27b4d9bc84e6abd734eda7872b71a458569e2656c0
description: |
Iris Dataset
https://archive.ics.uci.edu/ml/datasets/Iris
columns:
- name: sepal_length_cm
type: number
- name: sepal_width_cm
type: number
- name: petal_length_cm
type: number
- name: petal_width_cm
type: number
- name: class
type: category
output_features:
- name: class
type: category
================================================
FILE: ludwig/datasets/configs/irony.yaml
================================================
version: 1.0
name: irony
download_urls: https://raw.githubusercontent.com/bwallace/ACL-2014-irony/master/irony-labeled.csv
sha256:
irony-labeled.csv: 11f4d0964bd9c5c8363de2920612f5d926a4e6b3a8ab9187da2c33cfc0fdd02b
description: |
The Reddit Irony dataset.
Source Paper: Humans Require Context to Infer Ironic Intent (so Computers Probably do, too)
Byron C Wallace, Do Kook Choe, Laura Kertz, and Eugene Charniak
columns:
- name: comment_text
type: text
- name: label
type: binary
output_features:
- name: label
type: binary
================================================
FILE: ludwig/datasets/configs/jc_penney_products.yaml
================================================
version: 1.0
name: jc_penney_products
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/jc_penney_products/train.csv
- https://automl-mm-bench.s3.amazonaws.com/jc_penney_products/test.csv
sha256:
test.csv: 458fb13b07701897fbc0d88481823b90e884e92a42e65eeba816cdf3523b2e85
train.csv: e9e3d3da627dc544d01f4c27b1d023288c68e55ce2db2593fb7b2268a6b9b020
train_filenames: train.csv
test_filenames: test.csv
description: |
JCPenney products
20,000 product listings from JCPenney
https://www.kaggle.com/PromptCloudHQ/all-jc-penny-products
columns:
- name: name_title
type: category
- name: description
type: text
- name: sale_price
type: number
- name: average_product_rating
type: number
- name: brand
type: category
- name: total_number_reviews
type: number
output_features:
- name: sale_price
type: number
================================================
FILE: ludwig/datasets/configs/jigsaw_unintended_bias.yaml
================================================
version: 1.0
name: jigsaw_unintended_bias
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/jigsaw_unintended_bias/train.pq
- https://automl-mm-bench.s3.amazonaws.com/jigsaw_unintended_bias/dev.pq
- https://automl-mm-bench.s3.amazonaws.com/jigsaw_unintended_bias/test.pq
sha256:
test.pq: e9f3fd6fa83ddea2af8d21e93eb677b2fa5686c9b8ae38e6293f7c3306f66fad
train.pq: 30bedd5bbd5b2277b8bffa4ed3a02ce6ef7c838aa5c1338908b5ad599a6a9888
dev.pq: 57e1e3a06733fb83ad9ca46839ed8afd7d670e5e5f5c7f0026b748d760457d57
train_filenames: train.pq
validation_filenames: dev.pq
test_filenames: test.pq
description: |
A dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias.
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
columns:
- name: id
type: category
- name: target
type: binary
- name: comment_text
type: text
- name: severe_toxicity
type: number
- name: obscene
type: number
- name: identity_attack
type: number
- name: insult
type: number
- name: threat
type: number
- name: asian
type: number
- name: atheist
type: number
- name: bisexual
type: number
- name: black
type: number
- name: buddhist
type: number
- name: christian
type: number
- name: female
type: number
- name: heterosexual
type: number
- name: hindu
type: number
- name: homosexual_gay_or_lesbian
type: number
- name: intellectual_or_learning_disability
type: number
- name: jewish
type: number
- name: latino
type: number
- name: male
type: number
- name: muslim
type: number
- name: other_disability
type: number
- name: other_gender
type: number
- name: other_race_or_ethnicity
type: number
- name: other_religion
type: number
- name: other_sexual_orientation
type: number
- name: physical_disability
type: number
- name: psychiatric_or_mental_illness
type: number
- name: transgender
type: number
- name: white
type: number
- name: created_date
type: date
- name: publication_id
type: category
- name: parent_id
type: category
- name: article_id
type: category
- name: rating
type: category
- name: funny
type: number
- name: wow
type: number
- name: sad
type: number
- name: likes
type: number
- name: disagree
type: number
- name: sexual_explicit
type: number
- name: identity_annotator_count
type: number
- name: toxicity_annotator_count
type: number
output_features:
- name: target
type: binary
================================================
FILE: ludwig/datasets/configs/jigsaw_unintended_bias100k.yaml
================================================
version: 1.0
name: jigsaw_unintended_bias100K
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/jigsaw_unintended_bias100K/train.pq
- https://automl-mm-bench.s3.amazonaws.com/jigsaw_unintended_bias100K/test.pq
sha256:
test.pq: f7a0ec60ac89ffdb94919bf95e514057588a444c90ebdcb8ac90dfb0bfec3d48
train.pq: 48916c037b0a20167f6e9176cc1eedcb0e6ef942beeedb7dc02f19dfebac0229
train_filenames: train.pq
test_filenames: test.pq
description: |
A dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias.
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
columns:
- name: id
type: category
- name: target
type: binary
- name: comment_text
type: text
- name: severe_toxicity
type: number
- name: obscene
type: number
- name: identity_attack
type: number
- name: insult
type: number
- name: threat
type: number
- name: asian
type: number
- name: atheist
type: number
- name: bisexual
type: number
- name: black
type: number
- name: buddhist
type: number
- name: christian
type: number
- name: female
type: number
- name: heterosexual
type: number
- name: hindu
type: number
- name: homosexual_gay_or_lesbian
type: number
- name: intellectual_or_learning_disability
type: number
- name: jewish
type: number
- name: latino
type: number
- name: male
type: number
- name: muslim
type: number
- name: other_disability
type: number
- name: other_gender
type: number
- name: other_race_or_ethnicity
type: number
- name: other_religion
type: number
- name: other_sexual_orientation
type: number
- name: physical_disability
type: number
- name: psychiatric_or_mental_illness
type: number
- name: transgender
type: number
- name: white
type: number
- name: created_date
type: date
- name: publication_id
type: category
- name: parent_id
type: category
- name: article_id
type: category
- name: rating
type: category
- name: funny
type: number
- name: wow
type: number
- name: sad
type: number
- name: likes
type: number
- name: disagree
type: number
- name: sexual_explicit
type: number
- name: identity_annotator_count
type: number
- name: toxicity_annotator_count
type: number
output_features:
- name: target
type: binary
================================================
FILE: ludwig/datasets/configs/kdd_appetency.yaml
================================================
version: 1.0
name: kdd_appetency
download_urls:
- https://kdd.org/cupfiles/KDDCupData/2009/orange_small_train.data.zip
- https://kdd.org/cupfiles/KDDCupData/2009/orange_small_test.data.zip
- https://kdd.org/cupfiles/KDDCupData/2009/orange_small_train_appetency.labels
- https://raw.githubusercontent.com/catboost/benchmarks/master/quality_benchmarks/prepare_appetency_churn_upselling/appetency/stratified_train_idx_appetency.txt
- https://raw.githubusercontent.com/catboost/benchmarks/master/quality_benchmarks/prepare_appetency_churn_upselling/appetency/stratified_test_idx_appetency.txt
sha256:
orange_small_test.data.zip: 440ac8a350144c14f4d6947c096ad675ee84aa27b4b742071662696e333cec53
orange_small_train.data.zip: 31ccb810bdbb71c16e079326443166dc3dfbf73cd358fc4a4ce7440fb1bc6040
orange_small_train_appetency.labels: edbfa40e7513804cf25c3f8b3c8f4a6cf5c77116cffc2f87ef770351250a963c
stratified_train_idx_appetency.txt: 9c6bf7da6209653e13d9a1d2ef90e4afafe0ecac0eb843c8025816a445c625d9
stratified_test_idx_appetency.txt: b80fb8dcf43cd028f4b8affeab65299d580a7e5432ebbe639527dc8177f8764a
dataset_filenames: orange_small_train.data
loader: kdd_loader.KDDAppetencyLoader
description: |
The KDD Cup 2009 Appetency dataset.
https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
columns:
- name: Var1
type: number
- name: Var2
type: number
- name: Var3
type: number
- name: Var4
type: number
- name: Var5
type: number
- name: Var6
type: number
- name: Var7
type: number
- name: Var8
type: number
- name: Var9
type: number
- name: Var10
type: number
- name: Var11
type: number
- name: Var12
type: number
- name: Var13
type: number
- name: Var14
type: number
- name: Var15
type: number
- name: Var16
type: number
- name: Var17
type: number
- name: Var18
type: number
- name: Var19
type: number
- name: Var20
type: number
- name: Var21
type: number
- name: Var22
type: number
- name: Var23
type: number
- name: Var24
type: number
- name: Var25
type: number
- name: Var26
type: number
- name: Var27
type: number
- name: Var28
type: number
- name: Var29
type: number
- name: Var30
type: number
- name: Var31
type: number
- name: Var32
type: number
- name: Var33
type: number
- name: Var34
type: number
- name: Var35
type: number
- name: Var36
type: number
- name: Var37
type: number
- name: Var38
type: number
- name: Var39
type: number
- name: Var40
type: number
- name: Var41
type: number
- name: Var42
type: number
- name: Var43
type: number
- name: Var44
type: number
- name: Var45
type: number
- name: Var46
type: number
- name: Var47
type: number
- name: Var48
type: number
- name: Var49
type: number
- name: Var50
type: number
- name: Var51
type: number
- name: Var52
type: number
- name: Var53
type: number
- name: Var54
type: number
- name: Var55
type: number
- name: Var56
type: number
- name: Var57
type: number
- name: Var58
type: number
- name: Var59
type: number
- name: Var60
type: number
- name: Var61
type: number
- name: Var62
type: number
- name: Var63
type: number
- name: Var64
type: number
- name: Var65
type: number
- name: Var66
type: number
- name: Var67
type: number
- name: Var68
type: number
- name: Var69
type: number
- name: Var70
type: number
- name: Var71
type: number
- name: Var72
type: number
- name: Var73
type: number
- name: Var74
type: number
- name: Var75
type: number
- name: Var76
type: number
- name: Var77
type: number
- name: Var78
type: number
- name: Var79
type: number
- name: Var80
type: number
- name: Var81
type: number
- name: Var82
type: number
- name: Var83
type: number
- name: Var84
type: number
- name: Var85
type: number
- name: Var86
type: number
- name: Var87
type: number
- name: Var88
type: number
- name: Var89
type: number
- name: Var90
type: number
- name: Var91
type: number
- name: Var92
type: number
- name: Var93
type: number
- name: Var94
type: number
- name: Var95
type: number
- name: Var96
type: number
- name: Var97
type: number
- name: Var98
type: number
- name: Var99
type: number
- name: Var100
type: number
- name: Var101
type: number
- name: Var102
type: number
- name: Var103
type: number
- name: Var104
type: number
- name: Var105
type: number
- name: Var106
type: number
- name: Var107
type: number
- name: Var108
type: number
- name: Var109
type: number
- name: Var110
type: number
- name: Var111
type: number
- name: Var112
type: number
- name: Var113
type: number
- name: Var114
type: number
- name: Var115
type: number
- name: Var116
type: number
- name: Var117
type: number
- name: Var118
type: number
- name: Var119
type: number
- name: Var120
type: number
- name: Var121
type: number
- name: Var122
type: number
- name: Var123
type: number
- name: Var124
type: number
- name: Var125
type: number
- name: Var126
type: number
- name: Var127
type: number
- name: Var128
type: number
- name: Var129
type: number
- name: Var130
type: number
- name: Var131
type: number
- name: Var132
type: number
- name: Var133
type: number
- name: Var134
type: number
- name: Var135
type: number
- name: Var136
type: number
- name: Var137
type: number
- name: Var138
type: number
- name: Var139
type: number
- name: Var140
type: number
- name: Var141
type: number
- name: Var142
type: number
- name: Var143
type: number
- name: Var144
type: number
- name: Var145
type: number
- name: Var146
type: number
- name: Var147
type: number
- name: Var148
type: number
- name: Var149
type: number
- name: Var150
type: number
- name: Var151
type: number
- name: Var152
type: number
- name: Var153
type: number
- name: Var154
type: number
- name: Var155
type: number
- name: Var156
type: number
- name: Var157
type: number
- name: Var158
type: number
- name: Var159
type: number
- name: Var160
type: number
- name: Var161
type: number
- name: Var162
type: number
- name: Var163
type: number
- name: Var164
type: number
- name: Var165
type: number
- name: Var166
type: number
- name: Var167
type: number
- name: Var168
type: number
- name: Var169
type: number
- name: Var170
type: number
- name: Var171
type: number
- name: Var172
type: number
- name: Var173
type: number
- name: Var174
type: number
- name: Var175
type: number
- name: Var176
type: number
- name: Var177
type: number
- name: Var178
type: number
- name: Var179
type: number
- name: Var180
type: number
- name: Var181
type: number
- name: Var182
type: number
- name: Var183
type: number
- name: Var184
type: number
- name: Var185
type: number
- name: Var186
type: number
- name: Var187
type: number
- name: Var188
type: number
- name: Var189
type: number
- name: Var190
type: number
- name: Var191
type: category
- name: Var192
type: category
- name: Var193
type: category
- name: Var194
type: category
- name: Var195
type: category
- name: Var196
type: category
- name: Var197
type: category
- name: Var198
type: category
- name: Var199
type: category
- name: Var200
type: category
- name: Var201
type: category
- name: Var202
type: category
- name: Var203
type: category
- name: Var204
type: category
- name: Var205
type: category
- name: Var206
type: category
- name: Var207
type: category
- name: Var208
type: category
- name: Var209
type: number
- name: Var210
type: category
- name: Var211
type: category
- name: Var212
type: category
- name: Var213
type: category
- name: Var214
type: category
- name: Var215
type: category
- name: Var216
type: category
- name: Var217
type: category
- name: Var218
type: category
- name: Var219
type: category
- name: Var220
type: category
- name: Var221
type: category
- name: Var222
type: category
- name: Var223
type: category
- name: Var224
type: category
- name: Var225
type: category
- name: Var226
type: category
- name: Var227
type: category
- name: Var228
type: category
- name: Var229
type: category
- name: Var230
type: number
- name: target
type: binary
output_features:
- name: target
type: binary
================================================
FILE: ludwig/datasets/configs/kdd_churn.yaml
================================================
version: 1.0
name: kdd_churn
download_urls:
- https://kdd.org/cupfiles/KDDCupData/2009/orange_small_train.data.zip
- https://kdd.org/cupfiles/KDDCupData/2009/orange_small_test.data.zip
- https://kdd.org/cupfiles/KDDCupData/2009/orange_small_train_churn.labels
- https://raw.githubusercontent.com/catboost/benchmarks/master/quality_benchmarks/prepare_appetency_churn_upselling/churn/stratified_train_idx_churn.txt
- https://raw.githubusercontent.com/catboost/benchmarks/master/quality_benchmarks/prepare_appetency_churn_upselling/churn/stratified_test_idx_churn.txt
sha256:
orange_small_test.data.zip: 440ac8a350144c14f4d6947c096ad675ee84aa27b4b742071662696e333cec53
orange_small_train.data.zip: 31ccb810bdbb71c16e079326443166dc3dfbf73cd358fc4a4ce7440fb1bc6040
orange_small_train_churn.labels: fe8891cc574bd55a214514e522a5bed1eec2c3f347a49a36e51620009e7b6f5b
stratified_train_idx_churn.txt: 34f9880959ced6f668b25f879fdd388b3826efeca0df03f5a2a5494ce6795406
stratified_test_idx_churn.txt: 1675a62cd49c43535eedee3b746f65f8c6a4ebd7f4d0da04e442fd658a408042
dataset_filenames: orange_small_train.data
loader: kdd_loader.KDDChurnLoader
description: |
The KDD Cup 2009 Churn dataset.
https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
columns:
- name: Var1
type: number
- name: Var2
type: number
- name: Var3
type: number
- name: Var4
type: number
- name: Var5
type: number
- name: Var6
type: number
- name: Var7
type: number
- name: Var8
type: number
- name: Var9
type: number
- name: Var10
type: number
- name: Var11
type: number
- name: Var12
type: number
- name: Var13
type: number
- name: Var14
type: number
- name: Var15
type: number
- name: Var16
type: number
- name: Var17
type: number
- name: Var18
type: number
- name: Var19
type: number
- name: Var20
type: number
- name: Var21
type: number
- name: Var22
type: number
- name: Var23
type: number
- name: Var24
type: number
- name: Var25
type: number
- name: Var26
type: number
- name: Var27
type: number
- name: Var28
type: number
- name: Var29
type: number
- name: Var30
type: number
- name: Var31
type: number
- name: Var32
type: number
- name: Var33
type: number
- name: Var34
type: number
- name: Var35
type: number
- name: Var36
type: number
- name: Var37
type: number
- name: Var38
type: number
- name: Var39
type: number
- name: Var40
type: number
- name: Var41
type: number
- name: Var42
type: number
- name: Var43
type: number
- name: Var44
type: number
- name: Var45
type: number
- name: Var46
type: number
- name: Var47
type: number
- name: Var48
type: number
- name: Var49
type: number
- name: Var50
type: number
- name: Var51
type: number
- name: Var52
type: number
- name: Var53
type: number
- name: Var54
type: number
- name: Var55
type: number
- name: Var56
type: number
- name: Var57
type: number
- name: Var58
type: number
- name: Var59
type: number
- name: Var60
type: number
- name: Var61
type: number
- name: Var62
type: number
- name: Var63
type: number
- name: Var64
type: number
- name: Var65
type: number
- name: Var66
type: number
- name: Var67
type: number
- name: Var68
type: number
- name: Var69
type: number
- name: Var70
type: number
- name: Var71
type: number
- name: Var72
type: number
- name: Var73
type: number
- name: Var74
type: number
- name: Var75
type: number
- name: Var76
type: number
- name: Var77
type: number
- name: Var78
type: number
- name: Var79
type: number
- name: Var80
type: number
- name: Var81
type: number
- name: Var82
type: number
- name: Var83
type: number
- name: Var84
type: number
- name: Var85
type: number
- name: Var86
type: number
- name: Var87
type: number
- name: Var88
type: number
- name: Var89
type: number
- name: Var90
type: number
- name: Var91
type: number
- name: Var92
type: number
- name: Var93
type: number
- name: Var94
type: number
- name: Var95
type: number
- name: Var96
type: number
- name: Var97
type: number
- name: Var98
type: number
- name: Var99
type: number
- name: Var100
type: number
- name: Var101
type: number
- name: Var102
type: number
- name: Var103
type: number
- name: Var104
type: number
- name: Var105
type: number
- name: Var106
type: number
- name: Var107
type: number
- name: Var108
type: number
- name: Var109
type: number
- name: Var110
type: number
- name: Var111
type: number
- name: Var112
type: number
- name: Var113
type: number
- name: Var114
type: number
- name: Var115
type: number
- name: Var116
type: number
- name: Var117
type: number
- name: Var118
type: number
- name: Var119
type: number
- name: Var120
type: number
- name: Var121
type: number
- name: Var122
type: number
- name: Var123
type: number
- name: Var124
type: number
- name: Var125
type: number
- name: Var126
type: number
- name: Var127
type: number
- name: Var128
type: number
- name: Var129
type: number
- name: Var130
type: number
- name: Var131
type: number
- name: Var132
type: number
- name: Var133
type: number
- name: Var134
type: number
- name: Var135
type: number
- name: Var136
type: number
- name: Var137
type: number
- name: Var138
type: number
- name: Var139
type: number
- name: Var140
type: number
- name: Var141
type: number
- name: Var142
type: number
- name: Var143
type: number
- name: Var144
type: number
- name: Var145
type: number
- name: Var146
type: number
- name: Var147
type: number
- name: Var148
type: number
- name: Var149
type: number
- name: Var150
type: number
- name: Var151
type: number
- name: Var152
type: number
- name: Var153
type: number
- name: Var154
type: number
- name: Var155
type: number
- name: Var156
type: number
- name: Var157
type: number
- name: Var158
type: number
- name: Var159
type: number
- name: Var160
type: number
- name: Var161
type: number
- name: Var162
type: number
- name: Var163
type: number
- name: Var164
type: number
- name: Var165
type: number
- name: Var166
type: number
- name: Var167
type: number
- name: Var168
type: number
- name: Var169
type: number
- name: Var170
type: number
- name: Var171
type: number
- name: Var172
type: number
- name: Var173
type: number
- name: Var174
type: number
- name: Var175
type: number
- name: Var176
type: number
- name: Var177
type: number
- name: Var178
type: number
- name: Var179
type: number
- name: Var180
type: number
- name: Var181
type: number
- name: Var182
type: number
- name: Var183
type: number
- name: Var184
type: number
- name: Var185
type: number
- name: Var186
type: number
- name: Var187
type: number
- name: Var188
type: number
- name: Var189
type: number
- name: Var190
type: number
- name: Var191
type: category
- name: Var192
type: category
- name: Var193
type: category
- name: Var194
type: category
- name: Var195
type: category
- name: Var196
type: category
- name: Var197
type: category
- name: Var198
type: category
- name: Var199
type: category
- name: Var200
type: category
- name: Var201
type: category
- name: Var202
type: category
- name: Var203
type: category
- name: Var204
type: category
- name: Var205
type: category
- name: Var206
type: category
- name: Var207
type: category
- name: Var208
type: category
- name: Var209
type: number
- name: Var210
type: category
- name: Var211
type: category
- name: Var212
type: category
- name: Var213
type: category
- name: Var214
type: category
- name: Var215
type: category
- name: Var216
type: category
- name: Var217
type: category
- name: Var218
type: category
- name: Var219
type: category
- name: Var220
type: category
- name: Var221
type: category
- name: Var222
type: category
- name: Var223
type: category
- name: Var224
type: category
- name: Var225
type: category
- name: Var226
type: category
- name: Var227
type: category
- name: Var228
type: category
- name: Var229
type: category
- name: Var230
type: number
- name: target
type: binary
output_features:
- name: target
type: binary
================================================
FILE: ludwig/datasets/configs/kdd_upselling.yaml
================================================
version: 1.0
name: kdd_upselling
download_urls:
- https://kdd.org/cupfiles/KDDCupData/2009/orange_small_train.data.zip
- https://kdd.org/cupfiles/KDDCupData/2009/orange_small_test.data.zip
- https://kdd.org/cupfiles/KDDCupData/2009/orange_small_train_upselling.labels
- https://raw.githubusercontent.com/catboost/benchmarks/master/quality_benchmarks/prepare_appetency_churn_upselling/upselling/stratified_train_idx_upselling.txt
- https://raw.githubusercontent.com/catboost/benchmarks/master/quality_benchmarks/prepare_appetency_churn_upselling/upselling/stratified_test_idx_upselling.txt
sha256:
orange_small_test.data.zip: 440ac8a350144c14f4d6947c096ad675ee84aa27b4b742071662696e333cec53
orange_small_train.data.zip: 31ccb810bdbb71c16e079326443166dc3dfbf73cd358fc4a4ce7440fb1bc6040
orange_small_train_upselling.labels: 86effe68394fe1ab21c2d855f74adf70f442990aa95dfe5c97340fc924440e68
stratified_train_idx_upselling.txt: 659060717872177d607fbb157e8d2142c719912771d1716da11ccdd6ff915a05
stratified_test_idx_upselling.txt: 64cb66ef559b4ccff096e0d7c150c7d019321ffd6cef2362c195a56c56effcb7
dataset_filenames: orange_small_train.data
loader: kdd_loader.KDDUpsellingLoader
description: |
The KDD Cup 2009 Upselling dataset.
https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
columns:
- name: Var1
type: number
- name: Var2
type: number
- name: Var3
type: number
- name: Var4
type: number
- name: Var5
type: number
- name: Var6
type: number
- name: Var7
type: number
- name: Var8
type: number
- name: Var9
type: number
- name: Var10
type: number
- name: Var11
type: number
- name: Var12
type: number
- name: Var13
type: number
- name: Var14
type: number
- name: Var15
type: number
- name: Var16
type: number
- name: Var17
type: number
- name: Var18
type: number
- name: Var19
type: number
- name: Var20
type: number
- name: Var21
type: number
- name: Var22
type: number
- name: Var23
type: number
- name: Var24
type: number
- name: Var25
type: number
- name: Var26
type: number
- name: Var27
type: number
- name: Var28
type: number
- name: Var29
type: number
- name: Var30
type: number
- name: Var31
type: number
- name: Var32
type: number
- name: Var33
type: number
- name: Var34
type: number
- name: Var35
type: number
- name: Var36
type: number
- name: Var37
type: number
- name: Var38
type: number
- name: Var39
type: number
- name: Var40
type: number
- name: Var41
type: number
- name: Var42
type: number
- name: Var43
type: number
- name: Var44
type: number
- name: Var45
type: number
- name: Var46
type: number
- name: Var47
type: number
- name: Var48
type: number
- name: Var49
type: number
- name: Var50
type: number
- name: Var51
type: number
- name: Var52
type: number
- name: Var53
type: number
- name: Var54
type: number
- name: Var55
type: number
- name: Var56
type: number
- name: Var57
type: number
- name: Var58
type: number
- name: Var59
type: number
- name: Var60
type: number
- name: Var61
type: number
- name: Var62
type: number
- name: Var63
type: number
- name: Var64
type: number
- name: Var65
type: number
- name: Var66
type: number
- name: Var67
type: number
- name: Var68
type: number
- name: Var69
type: number
- name: Var70
type: number
- name: Var71
type: number
- name: Var72
type: number
- name: Var73
type: number
- name: Var74
type: number
- name: Var75
type: number
- name: Var76
type: number
- name: Var77
type: number
- name: Var78
type: number
- name: Var79
type: number
- name: Var80
type: number
- name: Var81
type: number
- name: Var82
type: number
- name: Var83
type: number
- name: Var84
type: number
- name: Var85
type: number
- name: Var86
type: number
- name: Var87
type: number
- name: Var88
type: number
- name: Var89
type: number
- name: Var90
type: number
- name: Var91
type: number
- name: Var92
type: number
- name: Var93
type: number
- name: Var94
type: number
- name: Var95
type: number
- name: Var96
type: number
- name: Var97
type: number
- name: Var98
type: number
- name: Var99
type: number
- name: Var100
type: number
- name: Var101
type: number
- name: Var102
type: number
- name: Var103
type: number
- name: Var104
type: number
- name: Var105
type: number
- name: Var106
type: number
- name: Var107
type: number
- name: Var108
type: number
- name: Var109
type: number
- name: Var110
type: number
- name: Var111
type: number
- name: Var112
type: number
- name: Var113
type: number
- name: Var114
type: number
- name: Var115
type: number
- name: Var116
type: number
- name: Var117
type: number
- name: Var118
type: number
- name: Var119
type: number
- name: Var120
type: number
- name: Var121
type: number
- name: Var122
type: number
- name: Var123
type: number
- name: Var124
type: number
- name: Var125
type: number
- name: Var126
type: number
- name: Var127
type: number
- name: Var128
type: number
- name: Var129
type: number
- name: Var130
type: number
- name: Var131
type: number
- name: Var132
type: number
- name: Var133
type: number
- name: Var134
type: number
- name: Var135
type: number
- name: Var136
type: number
- name: Var137
type: number
- name: Var138
type: number
- name: Var139
type: number
- name: Var140
type: number
- name: Var141
type: number
- name: Var142
type: number
- name: Var143
type: number
- name: Var144
type: number
- name: Var145
type: number
- name: Var146
type: number
- name: Var147
type: number
- name: Var148
type: number
- name: Var149
type: number
- name: Var150
type: number
- name: Var151
type: number
- name: Var152
type: number
- name: Var153
type: number
- name: Var154
type: number
- name: Var155
type: number
- name: Var156
type: number
- name: Var157
type: number
- name: Var158
type: number
- name: Var159
type: number
- name: Var160
type: number
- name: Var161
type: number
- name: Var162
type: number
- name: Var163
type: number
- name: Var164
type: number
- name: Var165
type: number
- name: Var166
type: number
- name: Var167
type: number
- name: Var168
type: number
- name: Var169
type: number
- name: Var170
type: number
- name: Var171
type: number
- name: Var172
type: number
- name: Var173
type: number
- name: Var174
type: number
- name: Var175
type: number
- name: Var176
type: number
- name: Var177
type: number
- name: Var178
type: number
- name: Var179
type: number
- name: Var180
type: number
- name: Var181
type: number
- name: Var182
type: number
- name: Var183
type: number
- name: Var184
type: number
- name: Var185
type: number
- name: Var186
type: number
- name: Var187
type: number
- name: Var188
type: number
- name: Var189
type: number
- name: Var190
type: number
- name: Var191
type: category
- name: Var192
type: category
- name: Var193
type: category
- name: Var194
type: category
- name: Var195
type: category
- name: Var196
type: category
- name: Var197
type: category
- name: Var198
type: category
- name: Var199
type: category
- name: Var200
type: category
- name: Var201
type: category
- name: Var202
type: category
- name: Var203
type: category
- name: Var204
type: category
- name: Var205
type: category
- name: Var206
type: category
- name: Var207
type: category
- name: Var208
type: category
- name: Var209
type: number
- name: Var210
type: category
- name: Var211
type: category
- name: Var212
type: category
- name: Var213
type: category
- name: Var214
type: category
- name: Var215
type: category
- name: Var216
type: category
- name: Var217
type: category
- name: Var218
type: category
- name: Var219
type: category
- name: Var220
type: category
- name: Var221
type: category
- name: Var222
type: category
- name: Var223
type: category
- name: Var224
type: category
- name: Var225
type: category
- name: Var226
type: category
- name: Var227
type: category
- name: Var228
type: category
- name: Var229
type: category
- name: Var230
type: number
- name: target
type: binary
output_features:
- name: target
type: binary
================================================
FILE: ludwig/datasets/configs/kick_starter_funding.yaml
================================================
version: 1.0
name: kick_starter_funding
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/kick_starter_funding/train.csv
- https://automl-mm-bench.s3.amazonaws.com/kick_starter_funding/test.csv
sha256:
test.csv: 13c2d4b74ac8d1e258659b5f5fa74526b9d27e305f6c29ad7e853dfeeb01983c
train.csv: 3120b69f30bbc08c68940ab9e5d85d6cc2fbc9a65e8a24c66739179b6a60150e
train_filenames: train.csv
test_filenames: test.csv
description: |
Funding Successful Projects on Kickstarter
Predict if a project will get successfully funded or not using labeled data
https://www.kaggle.com/codename007/funding-successful-projects
columns:
- name: name
type: category
- name: desc
type: text
- name: goal
type: number
- name: keywords
type: category
- name: disable_communication
type: binary
- name: country
type: category
- name: currency
type: category
- name: deadline
type: number
- name: created_at
type: number
- name: final_status
type: binary
output_features:
- name: final_status
type: binary
================================================
FILE: ludwig/datasets/configs/melbourne_airbnb.yaml
================================================
version: 1.0
name: melbourne_airbnb
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/airbnb_melbourne/train.pq
- https://automl-mm-bench.s3.amazonaws.com/airbnb_melbourne/test.pq
sha256:
test.pq: 9fe965cfdbd24ee9af7a004a7dc8c4e535a7ffceb722dce00f8ea90a54f95aa9
train.pq: c158c0f497ef355ba9d5de0de7556f6eb7f9bc343a67c4c681b014f6c7412e48
train_filenames: train.pq
test_filenames: test.pq
description: |
Melbourne Airbnb Open Data
Detailed and summarized data of Airbnb activity in Melbourne, VIC, Australia
https://www.kaggle.com/tylerx/melbourne-airbnb-open-data
columns:
- name: id
type: number
- name: listing_url
type: category
- name: scrape_id
type: number
- name: last_scraped
type: date
- name: text
type: category
- name: summary
type: text
- name: space
type: text
- name: description
type: text
- name: neighborhood_overview
type: text
- name: notes
type: text
- name: transit
type: text
- name: access
type: text
- name: interaction
type: text
- name: house_rules
type: text
- name: picture_url
type: category
- name: host_id
type: category
- name: host_url
type: category
- name: host_name
type: category
- name: host_since
type: date
- name: host_location
type: category
- name: host_about
type: text
- name: host_response_time
type: category
- name: host_response_rate
type: category
- name: host_is_superhost
type: binary
- name: host_thumbnail_url
type: category
- name: host_picture_url
type: category
- name: host_neighborhood
type: category
- name: host_verifications
type: set
- name: host_has_profile_pic
type: binary
- name: host_identity_verified
type: binary
- name: street
type: category
- name: neighborhood
type: category
- name: city
type: category
- name: suburb
type: category
- name: state
type: category
- name: zipcode
type: category
- name: smart_location
type: category
- name: country_code
type: category
- name: country
type: category
- name: latitude
type: number
- name: longitude
type: number
- name: is_location_exact
type: binary
- name: property_type
type: category
- name: room_type
type: category
- name: accommodates
type: number
- name: bathrooms
type: number
- name: bedrooms
type: number
- name: beds
type: number
- name: bed_type
type: category
- name: amenities
type: set
- name: price
type: number
- name: weekly_price
type: number
- name: monthly_price
type: number
- name: security_deposit
type: number
- name: cleaning_fee
type: number
- name: guests_included
type: number
- name: extra_people
type: number
- name: minimum_nights
type: number
- name: maximum_nights
type: number
- name: calendar_updated
type: category
- name: has_availability
type: binary
- name: availability_30
type: number
- name: availability_60
type: number
- name: availability_90
type: number
- name: availability_365
type: number
- name: calendar_last_scraped
type: date
- name: number_of_reviews
type: number
- name: first_review
type: date
- name: last_review
type: date
- name: review_scores_rating
type: number
- name: review_scores_accuracy
type: number
- name: review_scores_cleanliness
type: number
- name: review_scores_checkin
type: number
- name: review_scores_communication
type: number
- name: review_scores_location
type: number
- name: review_scores_value
type: number
- name: requires_license
type: binary
- name: license
type: category
- name: instant_bookable
type: binary
- name: cancellation_policy
type: category
- name: require_guest_profile_picture
type: binary
- name: require_guest_phone_verification
type: binary
- name: calculated_host_listings_count
type: number
- name: reviews_per_month
type: number
- name: price_label
type: number
- name: host_verifications_jumio
type: binary
- name: host_verifications_government_id
type: binary
- name: host_verifications_kba
type: binary
- name: host_verifications_zhima_selfie
type: binary
- name: host_verifications_facebook
type: binary
- name: host_verifications_work_email
type: binary
- name: host_verifications_google
type: binary
- name: host_verifications_sesame
type: binary
- name: host_verifications_manual_online
type: binary
- name: host_verifications_manual_offline
type: binary
- name: host_verifications_offline_government_id
type: binary
- name: host_verifications_selfie
type: binary
- name: host_verifications_reviews
type: binary
- name: host_verifications_identity_manual
type: binary
- name: host_verifications_sesame_offline
type: binary
- name: host_verifications_weibo
type: binary
- name: host_verifications_email
type: binary
- name: host_verifications_sent_id
type: binary
- name: host_verifications_phone
type: binary
output_features:
- name: price_label
type: category
================================================
FILE: ludwig/datasets/configs/mercari_price_suggestion.yaml
================================================
version: 1.0
name: mercari_price_suggestion
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/mercari_price_suggestion/train.pq
- https://automl-mm-bench.s3.amazonaws.com/mercari_price_suggestion/dev.pq
- https://automl-mm-bench.s3.amazonaws.com/mercari_price_suggestion/test.pq
sha256:
test.pq: 05fed940f5545e6a470ca595d014a02b173fd3362ca5bc5c458d02640b892a57
train.pq: a0613b77714ebb9f8927cf6bff2092af8143f4a66a64e45e9c3bf9d18604cfe3
dev.pq: f7284b86adde0354f30ee2c2b7a7a55dc895d202b4291138e807c8f3eaacb6b0
train_filenames: train.pq
validation_filenames: dev.pq
test_filenames: test.pq
description: |
Predict product price based on details like product category name, brand name, and item condition.
We have converted price to log price by log(1 + price).
https://www.kaggle.com/c/mercari-price-suggestion-challenge
columns:
- name: train_id
type: category
- name: name
type: category
- name: item_condition_id
type: category
- name: category_name
type: category
- name: brand_name
type: category
- name: price
type: number
- name: shipping
type: binary
- name: item_description
type: text
- name: log_price
type: number
- name: cat1
type: category
- name: cat2
type: category
- name: cat3
type: category
output_features:
- name: log_price
type: number
================================================
FILE: ludwig/datasets/configs/mercari_price_suggestion100K.yaml
================================================
version: 1.0
name: mercari_price_suggestion100K
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/mercari_price_suggestion100K/train.pq
- https://automl-mm-bench.s3.amazonaws.com/mercari_price_suggestion100K/test.pq
sha256:
test.pq: 60431577bd6cb433bae287ced2edc7a557497b66b1fe90e2fbec6ffc24bf35eb
train.pq: f60063847d9b828f1e9366eb69fa53774771b53291586d1cce506c931b7173f4
train_filenames: train.pq
test_filenames: test.pq
description: |
Predict product price based on details like product category name, brand name, and item condition.
We have converted price to log price by log(1 + price).
https://www.kaggle.com/c/mercari-price-suggestion-challenge
columns:
- name: train_id
type: category
- name: name
type: category
- name: item_condition_id
type: category
- name: category_name
type: category
- name: brand_name
type: category
- name: price
type: number
- name: shipping
type: binary
- name: item_description
type: text
- name: log_price
type: number
- name: cat1
type: category
- name: cat2
type: category
- name: cat3
type: category
output_features:
- name: log_price
type: number
================================================
FILE: ludwig/datasets/configs/mercedes_benz_greener.yaml
================================================
version: 1.0
name: mercedes_benz_greener
kaggle_competition: mercedes-benz-greener-manufacturing
archive_filenames: mercedes-benz-greener-manufacturing.zip
sha256:
mercedes-benz-greener-manufacturing.zip: 91143716085345a84dc4991b8eb1d5ff80d8aa134930de946b3b24be0f2e5d1a
train_filenames: train.csv
test_filenames: test.csv
description: |
The Mercedes-Benz Greener Manufacturing dataset.
https://www.kaggle.com/c/mercedes-benz-greener-manufacturing
output_features:
- name: y
type: number
fallback_mirrors:
- name: predibase
download_paths: s3://ludwig-tests/ludwig_backup/mercedes-benz-greener-manufacturing.zip
================================================
FILE: ludwig/datasets/configs/mnist.yaml
================================================
version: 1.0
name: mnist
download_urls:
- https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
- https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
- https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
- https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
sha256:
t10k-images-idx3-ubyte.gz: 8d422c7b0a1c1c79245a5bcf07fe86e33eeafee792b84584aec276f5a2dbc4e6
train-images-idx3-ubyte.gz: 440fcabf73cc546fa21475e81ea370265605f56be210a4024d2ca8f203523609
train-labels-idx1-ubyte.gz: 3552534a0a558bbed6aed32b30c495cca23d567ec52cac8be1a0730e8010255c
t10k-labels-idx1-ubyte.gz: f7ae60f92e00ec6debd23a6088c31dbd2371eca3ffa0defaefb259924204aec6
preserve_paths:
- training
- testing
loader: mnist.MNISTLoader
description: |
The MNIST database of handwritten digits, available from this page,
has a training set of 60,000 examples, and a test set of 10,000 examples.
It is a subset of a larger set available from NIST. The digits have been
size-normalized and centered in a fixed-size image.
It is a good database for people who want to try learning techniques and
pattern recognition methods on real-world data while spending minimal
efforts on preprocessing and formatting.
http://yann.lecun.com/exdb/mnist/
columns:
- name: image_path
type: image
- name: label
type: category
output_features:
- name: label
type: category
================================================
FILE: ludwig/datasets/configs/mushroom_edibility.yaml
================================================
version: 1.0
name: mushroom_edibility
download_urls: http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data
sha256:
agaricus-lepiota.data: e65d082030501a3ebcbcd7c9f7c71aa9d28fdfff463bf4cf4716a3fe13ac360e
train_filenames: agaricus-lepiota.data
description: |
This data set includes descriptions of hypothetical samples corresponding
to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525).
Each species is identified as definitely edible, definitely poisonous,
or of unknown edibility and not recommended. This latter class was combined with
the poisonous one.
columns:
- name: class
type: category
- name: cap-shape
type: category
- name: cap-surface
type: category
- name: cap-color
type: category
- name: bruises?
type: category
- name: odor
type: category
- name: gill-attachment
type: category
- name: gill-spacing
type: category
- name: gill-size
type: category
- name: gill-color
type: category
- name: stalk-shape
type: category
- name: stalk-root
type: category
- name: stalk-surface-above-ring
type: category
- name: stalk-surface-below-ring
type: category
- name: stalk-color-above-ring
type: category
- name: stalk-color-below-ring
type: category
- name: veil-type
type: category
- name: veil-color
type: category
- name: ring-number
type: category
- name: ring-type
type: category
- name: spore-print-color
type: category
- name: population
type: category
- name: habitat
type: category
output_features:
- name: class
type: category
================================================
FILE: ludwig/datasets/configs/naval.yaml
================================================
version: 1.0
name: naval
download_urls: http://archive.ics.uci.edu/ml/machine-learning-databases/00316/UCI%20CBM%20Dataset.zip
sha256:
UCI%20CBM%20Dataset.zip: 91a3815da80b5ab7e2d5b82ac82f1c2cbf89182c7a65bcdf240db1e014423cb9
dataset_filenames: UCI CBM Dataset/data.txt
loader: naval.NavalLoader
description: |
Condition Based Maintenance of Naval Propulsion Plants Data Set
http://archive.ics.uci.edu/ml/datasets/condition+based+maintenance+of+naval+propulsion+plants
columns:
- name: lp
type: number
- name: v
type: number
- name: gtt
type: number
- name: gtn
type: number
- name: ggn
type: number
- name: ts
type: number
- name: tp
type: number
- name: t48
type: number
- name: t1
type: number
- name: t2
type: number
- name: p48
type: number
- name: p1
type: number
- name: p2
type: number
- name: pexh
type: number
- name: tic
type: number
- name: mf
type: number
- name: gtcdsc
type: number
- name: gttdsc
type: number
output_features:
- name: gtcdsc
type: number
================================================
FILE: ludwig/datasets/configs/news_channel.yaml
================================================
version: 1.0
name: news_channel
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/news_channel/train.csv
- https://automl-mm-bench.s3.amazonaws.com/news_channel/test.csv
sha256:
test.csv: d48e7261dce69964eb1163c89e05261b8732c676b10de9b40339b2d95559c9c3
train.csv: 46e433fcf070ec684cfaf30bada482a73637e8dd954edc3e1fe860de8e661055
train_filenames: train.csv
test_filenames: test.csv
description: |
Online News Popularity Data Set
This dataset summarizes a heterogeneous set of features about articles
published by Mashable in a period of two years. The goal is to predict
the number of shares in social networks (popularity).
https://archive.ics.uci.edu/ml/datasets/online+news+popularity
columns: # Most lot of these columns have a leading space
- name: n_tokens_content
type: number
- name: n_unique_tokens
type: number
- name: n_non_stop_words
type: number
- name: n_non_stop_unique_tokens
type: number
- name: num_hrefs
type: number
- name: num_self_hrefs
type: number
- name: num_imgs
type: number
- name: num_videos
type: number
- name: average_token_length
type: number
- name: num_keywords
type: number
- name: global_subjectivity
type: number
- name: global_sentiment_polarity
type: number
- name: global_rate_positive_words
type: number
- name: global_rate_negative_words
type: number
- name: rate_positive_words
type: number
- name: rate_negative_words
type: number
- name: article_title
type: text
- name: channel
type: category
output_features:
- name: channel
type: category
================================================
FILE: ludwig/datasets/configs/news_popularity2.yaml
================================================
version: 1.0
name: news_popularity2
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/news_popularity2/train.csv
- https://automl-mm-bench.s3.amazonaws.com/news_popularity2/test.csv
sha256:
test.csv: 276effa981456e187fb1fc07abd8556d240e1a110fc5c096f2ad75a4082d1ccb
train.csv: 3673a07b87dbe09a9073e5ab83241681f561984269a9dc5411018fd9bca70b71
train_filenames: train.csv
test_filenames: test.csv
description: |
Online News Popularity Data Set
This dataset summarizes a heterogeneous set of features about articles
published by Mashable in a period of two years. The goal is to predict
the number of shares in social networks (popularity).
https://archive.ics.uci.edu/ml/datasets/online+news+popularity
columns:
- name: n_tokens_content
type: number
- name: average_token_length
type: number
- name: num_keywords
type: number
- name: log_shares
type: number
- name: article_title
type: text
output_features:
- name: log_shares
type: number
================================================
FILE: ludwig/datasets/configs/noshow_appointments.yaml
================================================
version: 1.0
name: noshow_appointments
kaggle_dataset_id: joniarroba/noshowappointments
archive_filenames: noshowappointments.zip
sha256:
noshowappointments.zip: 4b4f258837029bd4e61ed4c9bab2ce8a3b8a299d1a4f5bdabcc98967d5e29a43
loader: split_loaders.RandomSplitLoader
description: |
110.527 medical appointments its 14 associated variables (characteristics).
The most important one if the patient show-up or no-show to the appointment.
https://www.kaggle.com/datasets/joniarroba/noshowappointments
columns:
- name: PatientId
type: category
- name: AppointmentID
type: category
- name: Gender
type: binary
- name: ScheduledDay
type: date
- name: AppointmentDay
type: date
- name: Age
type: number
- name: Neighbourhood
type: category
- name: Scholarship
type: binary
- name: Hipertension
type: binary
- name: Diabetes
type: binary
- name: Alcoholism
type: binary
- name: Handcap
type: binary
- name: SMS_received
type: binary
- name: No-show
type: binary
output_features:
- name: No-show
type: binary
================================================
FILE: ludwig/datasets/configs/numerai28pt6.yaml
================================================
version: 1.0
name: numerai28pt6
kaggle_dataset_id: numerai/encrypted-stock-market-data-from-numerai
archive_filenames: encrypted-stock-market-data-from-numerai.zip
sha256:
encrypted-stock-market-data-from-numerai.zip: cc0714c5f4c8ac6b212f7569641c5110bd2296547af434cba77184ebb03f304b
description: |
Encrypted Stock Market Data from Numerai dataset from Kaggle.
columns:
- name: feature1
type: number
- name: feature2
type: number
- name: feature3
type: number
- name: feature4
type: number
- name: feature5
type: number
- name: feature6
type: number
- name: feature7
type: number
- name: feature8
type: number
- name: feature9
type: number
- name: feature10
type: number
- name: feature11
type: number
- name: feature12
type: number
- name: feature13
type: number
- name: feature14
type: number
- name: feature15
type: number
- name: feature16
type: number
- name: feature17
type: number
- name: feature18
type: number
- name: feature19
type: number
- name: feature20
type: number
- name: feature21
type: number
- name: target
type: binary
output_features:
- name: target
type: binary
================================================
FILE: ludwig/datasets/configs/ohsumed_7400.yaml
================================================
version: 1.0
name: ohsumed_7400
kaggle_dataset_id: weipengfei/ohr8r52
archive_filenames: ohr8r52.zip
sha256:
ohr8r52.zip: 93c7a8817a32b994d93267506ad766281764ba9382e3f4f9d978544cebab6ca4
train_filenames: oh/oh-train-stemmed.csv
validation_filenames: oh/oh-dev-stemmed.csv
test_filenames: oh/oh-test-stemmed.csv
description: |
Ohsumed corpus is extracted from MEDLINE database. MEDLINE is designed for multi-label classification, we remove the
text with two or more labels.
https://www.kaggle.com/datasets/weipengfei/ohr8r52
columns:
- name: text
type: text
- name: edge
type: text
- name: intent
type: category
output_features:
- name: intent
type: category
================================================
FILE: ludwig/datasets/configs/ohsumed_cmu.yaml
================================================
version: 1.0
name: ohsumed_cmu
download_urls: http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/ohsumed-allcats-6.zip
sha256:
ohsumed-allcats-6.zip: 3f2f6c4e27faaac1c8dc179a121bed92d6adbdf91a1e11d2d124f7bd963798da
description: |
OHSUMED is a well-known medical abstracts dataset. It contains 348,566 references,
and is still used for research and development.
This is a subset of OHSUMED containing 6 categories, from this CMU course:
http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/
columns:
- name: text
type: text
- name: class
type: category
output_features:
- name: class
type: category
================================================
FILE: ludwig/datasets/configs/otto_group_product.yaml
================================================
version: 1.0
name: otto_group_product
kaggle_competition: otto-group-product-classification-challenge
archive_filenames: otto-group-product-classification-challenge.zip
sha256:
otto-group-product-classification-challenge.zip: 81d1fa5805036772b7a2a2425311fdc7b1568af4fbb42f0ec8f9661d0d21ce42
train_filenames: train.csv
test_filenames: test.csv
description: |
The Otto Group Product Classification Challenge
https://www.kaggle.com/c/otto-group-product-classification-challenge/overview
output_features:
- name: target
type: category
================================================
FILE: ludwig/datasets/configs/poker_hand.yaml
================================================
version: 1.0
name: poker_hand
download_urls:
- http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-training-true.data
- http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data
train_filenames: poker-hand-training-true.data
test_filenames: poker-hand-testing.data
sha256:
poker-hand-testing.data: 3cd75958e19dd321ed5ca3f7f154c0f6aad544aab9f37731ac545b5f66b232c7
poker-hand-training-true.data: 37becdf87d5f8cbf2b91d6471e965a25b86cb4a6d878c0f94a4025969fca464f
description: |
Each record is an example of a hand consisting of five playing cards
drawn from a standard deck of 52. Each card is described using two
attributes (suit and rank), for a total of 10 predictive attributes.
There is one Class attribute that describes the "Poker Hand". The
order of cards is important, which is why there are 480 possible
Royal Flush hands as compared to 4.
https://archive.ics.uci.edu/ml/datasets/Poker+Hand
columns:
- name: S1
type: number
- name: C1
type: number
- name: S2
type: number
- name: C2
type: number
- name: S3
type: number
- name: C3
type: number
- name: S4
type: number
- name: C4
type: number
- name: S5
type: number
- name: C5
type: number
- name: hand
type: category
output_features:
- name: hand
type: category
================================================
FILE: ludwig/datasets/configs/porto_seguro_safe_driver.yaml
================================================
version: 1.0
name: porto_seguro_safe_driver
kaggle_competition: porto-seguro-safe-driver-prediction
archive_filenames: porto-seguro-safe-driver-prediction.zip
sha256:
porto-seguro-safe-driver-prediction.zip: 53dd7b67b9b3df088c4e0814cba7317d3bc8f76094c726471c8f91e84f61ccdc
train_filenames: train.csv
test_filenames: test.csv
description: |
Predict the probability that an auto insurance policy holder files a claim.
https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction
output_features:
- name: target
type: binary
================================================
FILE: ludwig/datasets/configs/product_sentiment_machine_hack.yaml
================================================
version: 1.0
name: product_sentiment_machine_hack
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/machine_hack_product_sentiment/train.csv
- https://automl-mm-bench.s3.amazonaws.com/machine_hack_product_sentiment/dev.csv
sha256:
dev.csv: 33adff4dba7d9322397b398900c20f678d3fffc5d87b0ea825d9aa497a343150
train.csv: 85a229e162b6d8c4839d1b27f834c36ae5e244fd027534fe62a888d4f536f0ef
train_filenames: train.csv
test_filenames: dev.csv
description: |
We challenge the machinehackers community to develop a machine learning model
to accurately classify various products into 4 different classes of sentiments
based on the raw text review provided by the user.
https://www.machinehack.com/hackathons/product_sentiment_classification_weekend_hackathon_19/overview
columns:
- name: Text_ID
type: category
- name: Product_Description
type: text
- name: Product_Type
type: category
- name: Sentiment
type: category
output_features:
- name: Sentiment
type: category
================================================
FILE: ludwig/datasets/configs/protein.yaml
================================================
version: 1.0
name: protein
download_urls: http://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv
sha256:
CASP.csv: 4277cfcb4e91a181746cbc654f001b57951c9e6a80f4f795fdb5c807e0848f40
description: |
Physicochemical Properties of Protein Tertiary Structure Data Set.
https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure
columns:
- name: RMSD
type: number
- name: F1
type: number
- name: F2
type: number
- name: F3
type: number
- name: F4
type: number
- name: F5
type: number
- name: F6
type: number
- name: F7
type: number
- name: F8
type: number
- name: F9
type: number
output_features:
- name: RMSD
type: number
================================================
FILE: ludwig/datasets/configs/reuters_cmu.yaml
================================================
version: 1.0
name: reuters_cmu
download_urls: http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/reuters-allcats-6.zip
sha256:
reuters-allcats-6.zip: 304ae223f9ca35f7ce9066c9d31558c06ed5c72cd91faa885f82b928b2aa6f34
description: |
Reuters-21578 is a well-known newswire dataset containing 21,578 documents.
This is a subset of Reuters-21578 using only 6 categories, from this CMU course:
http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW2/
columns:
- name: text
type: text
- name: class
type: category
output_features:
- name: class
type: category
================================================
FILE: ludwig/datasets/configs/reuters_r8.yaml
================================================
version: 1.0
name: reuters_r8
kaggle_dataset_id: weipengfei/ohr8r52
archive_filenames: ohr8r52.zip
sha256:
ohr8r52.zip: 93c7a8817a32b994d93267506ad766281764ba9382e3f4f9d978544cebab6ca4
train_filenames: r8/r8-train-stemmed.csv
validation_filenames: r8/r8-dev-stemmed.csv
test_filenames: r8/r8-test-stemmed.csv
description: |
Reuters R8 subset of Reuters 21578 dataset from Kaggle.
columns:
- name: text
type: text
- name: edge
type: text
- name: intent
type: category
output_features:
- name: intent
type: category
================================================
FILE: ludwig/datasets/configs/rossman_store_sales.yaml
================================================
version: 1.0
name: rossman_store_sales
kaggle_competition: rossmann-store-sales
archive_filenames: rossmann-store-sales.zip
sha256:
rossmann-store-sales.zip: 52ce715e02dc70cac16b14548580d656997f5d43ce3544220d5e574d26483cf3
loader: rossman_store_sales.RossmanStoreSalesLoader
description: |
The Rossmann Store Sales dataset.
Using the time split from the catboost benchmark
https://github.com/catboost/benchmarks/tree/master/kaggle/rossmann-store-sales
that is used in the TabNet paper,
because the test set does not contain sales ground truth.
output_features:
- name: Sales
type: number
================================================
FILE: ludwig/datasets/configs/santander_customer_satisfaction.yaml
================================================
version: 1.0
name: santander_customer_satisfaction
kaggle_competition: santander-customer-satisfaction
archive_filenames: santander-customer-satisfaction.zip
sha256:
santander-customer-satisfaction.zip: d4c2d068d8041af168d82d0eef7ad0b53ddd1d7fca9aba4e5d88fa1f957ee594
train_filenames: train.csv
test_filenames: test.csv
description: |
Santander Customer Satisfaction Prediction.
https://www.kaggle.com/c/santander-customer-satisfaction/overview
output_features:
- name: TARGET
type: binary
================================================
FILE: ludwig/datasets/configs/santander_customer_transaction.yaml
================================================
version: 1.0
name: santander_customer_transaction
kaggle_competition: santander-customer-transaction-prediction
archive_filenames: santander-customer-transaction-prediction.zip
sha256:
santander-customer-transaction-prediction.zip: b3a56d036b493a9cf0695018c968baba1ba7ef8c39d842cc5626e72f13c0ec69
train_filenames: train.csv
test_filenames: test.csv
description: |
Santander Customer Transaction Prediction.
https://www.kaggle.com/c/santander-customer-transaction-prediction/overview
output_features:
- name: target
type: binary
================================================
FILE: ludwig/datasets/configs/santander_value_prediction.yaml
================================================
version: 1.0
name: santander_value_prediction
kaggle_competition: santander-value-prediction-challenge
archive_filenames: santander-value-prediction-challenge.zip
sha256:
santander-value-prediction-challenge.zip: a8b44a0403bff6ab42f2bd1da8d9cbaf98f1fd4b9ea7a86e47491ac996384bf4
train_filenames: train.csv
loader: santander_value_prediction.SantanderValuePredictionLoader
description: |
The Santander Value Prediction Challenge dataset.
https://www.kaggle.com/c/santander-value-prediction-challenge
output_features:
- name: target
type: number
================================================
FILE: ludwig/datasets/configs/sarcastic_headlines.yaml
================================================
version: 1.0
name: sarcastic_headlines
train_filenames: Sarcasm_Headlines_Dataset.json
archive_filenames: news-headlines-dataset-for-sarcasm-detection.zip
sha256:
news-headlines-dataset-for-sarcasm-detection.zip: 3728f0fbce563536c3c67ab92e343e3ebcdc5cf1feaf4980c3abd4e54109eb51
kaggle_dataset_id: rmisra/news-headlines-dataset-for-sarcasm-detection
description: A dataset to determine if a news headline is sarcastic or serious.
loader: sarcastic_headlines.SarcasticHeadlinesLoader
columns:
- name: article_link
type: category
- name: headline
type: text
- name: is_sarcastic
type: binary
output_features:
- name: is_sarcastic
type: binary
================================================
FILE: ludwig/datasets/configs/sarcos.yaml
================================================
version: 1.0
name: sarcos
download_urls:
- http://www.gaussianprocess.org/gpml/data/sarcos_inv.mat
- http://www.gaussianprocess.org/gpml/data/sarcos_inv_test.mat
sha256:
sarcos_inv_test.mat: 161a59b5c3b4f4b404584323f181607b2acbe620eb134dc720760dc3f38f5cec
sarcos_inv.mat: b8a249733253ba6097372fedee7696833fcf30de42037d5b4a7227f21a6d1d97
train_filenames: sarcos_inv.mat
test_filenames: sarcos_inv_test.mat
loader: sarcos.SarcosLoader
description: |
The data relates to an inverse dynamics problem for a seven
degrees-of-freedom SARCOS anthropomorphic robot arm.
The task is to map from a 21-dimensional input space
(7 joint positions, 7 joint velocities, 7 joint accelerations)
to the corresponding 7 joint torques.
http://gaussianprocess.org/gpml/data/
columns:
- name: position_1
type: number
- name: position_2
type: number
- name: position_3
type: number
- name: position_4
type: number
- name: position_5
type: number
- name: position_6
type: number
- name: position_7
type: number
- name: velocity_1
type: number
- name: velocity_2
type: number
- name: velocity_3
type: number
- name: velocity_4
type: number
- name: velocity_5
type: number
- name: velocity_6
type: number
- name: velocity_7
type: number
- name: acceleration_1
type: number
- name: acceleration_2
type: number
- name: acceleration_3
type: number
- name: acceleration_4
type: number
- name: acceleration_5
type: number
- name: acceleration_6
type: number
- name: acceleration_7
type: number
- name: torque_1
type: number
- name: torque_2
type: number
- name: torque_3
type: number
- name: torque_4
type: number
- name: torque_5
type: number
- name: torque_6
type: number
- name: torque_7
type: number
output_features:
- name: torque_1
type: number
fallback_mirrors:
- name: predibase
download_paths:
- s3://ludwig-tests/ludwig_backup/sarcos_inv.mat
- s3://ludwig-tests/ludwig_backup/sarcos_inv_test.mat
================================================
FILE: ludwig/datasets/configs/sst2.yaml
================================================
version: 1.0
name: sst2
download_urls: https://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip
sha256:
stanfordSentimentTreebank.zip: 3f5209483b46bbf129cacbbbe6ae02fe780407034f61cf6342b7833257c3f1db
train_filenames: train.csv
validation_filenames: dev.csv
test_filenames: test.csv
loader: sst.SST2Loader
description: |
The SST2 dataset.
This dataset is constructed using the Stanford Sentiment Treebank Dataset.
This dataset contains binary labels (positive or negative) for each sample.
The original dataset specified 5 labels:
very negative, negative, neutral, positive, very positive with
the following cutoffs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
In the construction of this dataset, we remove all neutral phrases
and assign a negative label if the original rating falls
into the following range: [0, 0.4] and a positive label
if the original rating is between (0.6, 1.0].
columns:
- name: sentence
type: text
- name: label
type: binary
output_features:
- name: label
type: binary
================================================
FILE: ludwig/datasets/configs/sst3.yaml
================================================
version: 1.0
name: sst3
download_urls: https://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip
sha256:
stanfordSentimentTreebank.zip: 3f5209483b46bbf129cacbbbe6ae02fe780407034f61cf6342b7833257c3f1db
train_filenames: train.csv
validation_filenames: dev.csv
test_filenames: test.csv
loader: sst.SST3Loader
description: |
The SST3 dataset.
This dataset is constructed using the Stanford Sentiment Treebank Dataset.
The original dataset contains five labels (very negative, negative, neutral,
positive, very positive) for each sample.
In this dataset, the 3 labels negative, neutral, positive have the following cutoffs:
[0, 0.4], (0.4, 0.6], (0.6, 1.0]
columns:
- name: sentence
type: text
- name: label
type: category
output_features:
- name: label
type: category
================================================
FILE: ludwig/datasets/configs/sst5.yaml
================================================
version: 1.0
name: sst5
download_urls: https://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip
sha256:
stanfordSentimentTreebank.zip: 3f5209483b46bbf129cacbbbe6ae02fe780407034f61cf6342b7833257c3f1db
train_filenames: train.csv
validation_filenames: dev.csv
test_filenames: test.csv
loader: sst.SST5Loader
description: |
The SST5 dataset.
This dataset is constructed using the Stanford Sentiment Treebank Dataset.
This dataset contains five labels (very negative, negative, neutral,
positive, very positive) for each sample.
In the original dataset, the 5 labels: very negative, negative, neutral, positive,
and very positive have the following cutoffs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
columns:
- name: sentence
type: text
- name: label
type: category
output_features:
- name: label
type: category
================================================
FILE: ludwig/datasets/configs/synthetic_fraud.yaml
================================================
version: 1.0
name: synthetic_fraud
kaggle_dataset_id: ealaxi/paysim1
archive_filenames: paysim1.zip
sha256:
paysim1.zip: f7eef9ffad5cfa64a034143a5c9b30491d189420b273d5ad5723ca40b596613d
description: |
The Synthetic Financial Datasets For Fraud Detection dataset.
https://www.kaggle.com/ealaxi/paysim1
columns:
- name: step
type: category
- name: type
type: category
- name: amount
type: number
- name: nameOrig
type: category
- name: oldbalanceOrg
type: number
- name: newbalanceOrig
type: number
- name: nameDest
type: category
- name: oldbalanceDest
type: number
- name: newbalanceDest
type: number
- name: isFraud
type: binary
- name: isFlaggedFraud
type: binary
output_features:
- name: isFraud
type: binary
================================================
FILE: ludwig/datasets/configs/talkingdata_adtrack_fraud.yaml
================================================
version: 1.0
name: talkingdata_adtrack_fraud_detection
kaggle_competition: talkingdata-adtracking-fraud-detection
archive_filenames: talkingdata-adtracking-fraud-detection.zip
sha256:
talkingdata-adtracking-fraud-detection.zip: 4441bea984e936db153aba30627b222cb1685021efb887bd22d78771fb793735
train_filenames: train.csv
description: |
TalkingData AdTracking Fraud Detection Challenge.
https://www.kaggle.com/competitions/talkingdata-adtracking-fraud-detection/overview
output_features:
- name: is_attributed
type: binary
================================================
FILE: ludwig/datasets/configs/telco_customer_churn.yaml
================================================
version: 1.0
name: telco_customer_churn
kaggle_dataset_id: blastchar/telco-customer-churn
archive_filenames: telco-customer-churn.zip
dataset_filenames: WA_Fn-UseC_-Telco-Customer-Churn.csv
sha256:
telco-customer-churn.zip: cf7e6dcd8a238ecaa841a7d133142525453992d8d5e3ef6d1e5f0d359e7bf444
description: |
The Telco customer churn data contains information about a fictional telco company
that provided home phone and Internet services to customers. Each row represents a
customer, each column contains customer’s attributes described on the column Metadata.
https://www.kaggle.com/datasets/blastchar/telco-customer-churn
columns:
- name: customerID
type: category
- name: gender
type: binary
- name: SeniorCitizen
type: binary
- name: Partner
type: binary
- name: Dependents
type: binary
- name: tenure
type: number
- name: PhoneService
type: binary
- name: MultipleLines
type: category
- name: InternetService
type: category
- name: OnlineSecurity
type: category
- name: OnlineBackup
type: category
- name: DeviceProtection
type: category
- name: TechSupport
type: category
- name: StreamingTV
type: category
- name: StreamingMovies
type: category
- name: Contract
type: category
- name: PaperlessBilling
type: binary
- name: PaymentMethod
type: category
- name: MonthlyCharges
type: number
- name: TotalCharges
type: number
- name: Churn
type: binary
output_features:
- name: Churn
type: binary
================================================
FILE: ludwig/datasets/configs/temperature.yaml
================================================
version: 1.0
name: temperature
kaggle_dataset_id: selfishgene/historical-hourly-weather-data
archive_filenames: historical-hourly-weather-data.zip
sha256:
historical-hourly-weather-data.zip: db40ffce67318f366115b82a6f693d6dc82c808f23514e2ddae56c0434f606d7
dataset_filenames: temperature.csv
description: |
Hourly temperature dataset from Kaggle
https://www.kaggle.com/selfishgene/historical-hourly-weather-data
columns:
- name: datetime
type: date
- name: Vancouver
type: number
- name: Portland
type: number
- name: San Francisco
type: number
- name: Seattle
type: number
- name: Los Angeles
type: number
- name: San Diego
type: number
- name: Las Vegas
type: number
- name: Phoenix
type: number
- name: Albuquerque
type: number
- name: Denver
type: number
- name: San Antonio
type: number
- name: Dallas
type: number
- name: Houston
type: number
- name: Kansas City
type: number
- name: Minneapolis
type: number
- name: Saint Louis
type: number
- name: Chicago
type: number
- name: Nashville
type: number
- name: Indianapolis
type: number
- name: Atlanta
type: number
- name: Detroit
type: number
- name: Jacksonville
type: number
- name: Charlotte
type: number
- name: Miami
type: number
- name: Pittsburgh
type: number
- name: Toronto
type: number
- name: Philadelphia
type: number
- name: New York
type: number
- name: Montreal
type: number
- name: Boston
type: number
- name: Beersheba
type: number
- name: Tel Aviv District
type: number
- name: Eilat
type: number
- name: Haifa
type: number
- name: Nahariyya
type: number
- name: Jerusalem
type: number
output_features:
- name: San Francisco
type: number
================================================
FILE: ludwig/datasets/configs/titanic.yaml
================================================
version: 1.0
name: titanic
kaggle_competition: titanic
archive_filenames: titanic.zip
sha256:
titanic.zip: bb1bda464cc6819d412b41d34be69fd89d26b372dc24c09421c3dbca1b0dbe9f
train_filenames: train.csv
test_filenames: test.csv
description: |
The Titanic dataset: use machine learning to create a model
that predicts which passengers survived the Titanic shipwreck.
https://www.kaggle.com/c/titanic
output_features:
- name: Survived
type: binary
================================================
FILE: ludwig/datasets/configs/twitter_bots.yaml
================================================
version: 1.0
name: twitter_bots
kaggle_dataset_id: danieltreiman/twitter-human-bots-dataset
archive_filenames: twitter-human-bots-dataset.zip
dataset_filenames: twitter_human_bots_dataset.csv
sha256:
twitter-human-bots-dataset.zip: 16ffaad719ebb9688231844a80f92901c5efb1ff96eafeb869dc5de07b323cdd
preserve_paths:
- profile_images
- profile_background_images
description: |
A dataset for Twitter Bot account detection.
https://www.kaggle.com/datasets/davidmartngutirrez/twitter-bots-accounts
columns:
- name: created_at
type: date
- name: default_profile
type: binary
- name: default_profile_image
type: binary
- name: description
type: text
- name: favourites_count
type: number
- name: followers_count
type: number
- name: friends_count
type: number
- name: geo_enabled
type: binary
- name: id
type: category
- name: lang
type: category
- name: location
type: category
- name: profile_background_image_url
type: category
- name: profile_image_url
type: category
- name: screen_name
type: category
- name: statuses_count
type: number
- name: verified
type: binary
- name: average_tweets_per_day
type: number
- name: account_age_days
type: number
- name: account_type
type: category
- name: profile_image_path
type: image
- name: profile_background_image_path
type: image
output_features:
- name: account_type
type: binary
================================================
FILE: ludwig/datasets/configs/walmart_recruiting.yaml
================================================
version: 1.0
name: walmart_recruiting
kaggle_competition: walmart-recruiting-trip-type-classification
archive_filenames: walmart-recruiting-trip-type-classification.zip
sha256:
walmart-recruiting-trip-type-classification.zip: 4c0ad71034d0b907e018adcb00c7b2835d2c30abe770fde5ce8719d7b89d4de6
train_filenames: train.csv
description: |
Walmart Recruiting: Trip Type Classification
https://www.kaggle.com/c/walmart-recruiting-trip-type-classification
output_features:
- name: TripType
type: category
================================================
FILE: ludwig/datasets/configs/wine_reviews.yaml
================================================
version: 1.0
name: wine_reviews
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/wine_reviews/train.csv
- https://automl-mm-bench.s3.amazonaws.com/wine_reviews/test.csv
sha256:
test.csv: c862d1af572659406ab39356a25c7d5e9b7c8570a89e069311fca1abb6bf1849
train.csv: c54101bb07571a3df0723e93a5f7c48123dd792b316396db4404a04bcf1809cb
train_filenames: train.csv
test_filenames: test.csv
description: |
Wine Reviews
130k wine reviews with variety, location, winery, price, and description
https://www.kaggle.com/datasets/zynicide/wine-reviews
columns:
- name: country
type: category
- name: description
type: text
- name: points
type: number
- name: price
type: number
- name: province
type: category
- name: variety
type: category
output_features:
- name: points
type: number
================================================
FILE: ludwig/datasets/configs/wmt15.yaml
================================================
version: 1.0
name: wmt15
kaggle_dataset_id: dhruvildave/en-fr-translation-dataset
archive_filenames: en-fr-translation-dataset.zip
sha256:
en-fr-translation-dataset.zip: 5fb911b327f2f36ea32315b4754f6aef95e6830562eec7054d31d614dd53d93c
description: |
French/English parallel texts for training translation models.
Over 22.5 million sentences in French and English.
https://www.kaggle.com/dhruvildave/en-fr-translation-dataset
output_features:
- name: en
type: text
================================================
FILE: ludwig/datasets/configs/women_clothing_review.yaml
================================================
version: 1.0
name: women_clothing_review
download_urls:
- https://automl-mm-bench.s3.amazonaws.com/women_clothing_review/train.pq
- https://automl-mm-bench.s3.amazonaws.com/women_clothing_review/test.pq
sha256:
test.pq: 477de72fe7e672ef87e1eca00de312f55ba884a9b80fbd04fa79c0d0159e5593
train.pq: 1b3d248397cee76a6ccff814560f29ae3d66eeb26a6e97ac0837e021629bc740
train_filenames: train.pq
test_filenames: test.pq
description: |
Women's E-Commerce Clothing Reviews
23,000 Customer Reviews and Ratings
https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews
columns:
- name: Clothing ID
type: category
- name: Age
type: number
- name: Title
type: text
- name: Review Text
type: text
- name: Rating
type: number
- name: Recommended IND
type: binary
- name: Positive Feedback Count
type: number
- name: Division Name
type: category
- name: Department Name
type: category
- name: Class Name
type: category
output_features:
- name: Rating
type: number
================================================
FILE: ludwig/datasets/configs/yahoo_answers.yaml
================================================
version: 1.0
name: yahoo_answers
download_urls: https://s3.amazonaws.com/fast-ai-nlp/yahoo_answers_csv.tgz
sha256:
yahoo_answers_csv.tgz: 2d4277855faf8b35259009425fa8f7fe1888b5644b47165508942d000f4c96ae
train_filenames: yahoo_answers_csv/train.csv
test_filenames: yahoo_answers_csv/test.csv
description: |
The Yahoo Answers dataset
Details:
The 10 largest main categories from the Yahoo! Answers \
Comprehensive Questions and Answers version 1.0 dataset. \
Each class contains 140,000 training samples and 5,000 \
testing samples.
Dataset source:
Character-level Convolutional Networks for Text Classification
Xiang Zhang et al., 2015
https://arxiv.org/abs/1509.01626
columns:
- name: label
type: category
- name: question_title
type: text
- name: question
type: text
- name: best_answer
type: text
output_features:
- name: label
type: category
================================================
FILE: ludwig/datasets/configs/yelp_review_polarity.yaml
================================================
version: 1.0
name: yelp_review_polarity
download_urls: https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
sha256:
yelp_review_polarity_csv.tgz: 528f22e286cad085948acbc3bea7e58188416546b0e364d0ae4ca0ce666abe35
train_filenames: yelp_review_polarity_csv/train.csv
test_filenames: yelp_review_polarity_csv/test.csv
description: |
The Yelp Polarity dataset
Details:
1,569,264 samples from the Yelp Dataset Challenge 2015.
This subset has 280,000 training samples and 19,000 test samples
in each polarity.
Dataset source:
Character-level Convolutional Networks for Text Classification
Xiang Zhang et al., 2015
columns:
- name: label
type: binary
- name: text
type: text
output_features:
- name: label
type: binary
================================================
FILE: ludwig/datasets/configs/yelp_reviews.yaml
================================================
version: 1.0
name: yelp_reviews
download_urls: https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz
sha256:
yelp_review_full_csv.tgz: 56006b0a17a370f1e366504b1f2c3e3754e4a3dda17d3e718a885c552869a559
train_filenames: yelp_review_full_csv/train.csv
test_filenames: yelp_review_full_csv/test.csv
description: |
The Yelp Reviews dataset
Details:
1,569,264 samples from the Yelp Dataset Challenge 2015.
This subset has 130,000 training samples and 10,000
testing samples in each star rating.
Dataset source:
Character-level Convolutional Networks for Text Classification
Xiang Zhang et al., 2015
columns:
- name: label
type: category
- name: text
type: text
output_features:
- name: label
type: category
================================================
FILE: ludwig/datasets/configs/yosemite.yaml
================================================
version: 1.0
name: yosemite
download_urls: https://raw.githubusercontent.com/ourownstory/neuralprophet-data/main/datasets_raw/yosemite_temps.csv
sha256:
yosemite_temps.csv: c0ec9f2cb4bbf0bc53f7bfd2e39f88ae21e43b7b8912b2d1eb8185055f9510e2
description: |
Yosemite temperatures dataset.
As found in https://github.com/ourownstory/neural_prophet
columns:
- name: ds
type: date
- name: y
type: number
output_features:
- name: y
type: number
================================================
FILE: ludwig/datasets/dataset_config.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from dataclasses import dataclass, field
from dataclasses_json import dataclass_json
@dataclass_json
@dataclass
class DatasetFallbackMirror:
# Name of the mirror
name: str
# List of paths to download from. Must map 1:1 to DatasetConfig.download_urls or to the archive_filenames
# that we get from Kaggle.
download_paths: str | list[str]
@dataclass_json
@dataclass
class DatasetConfig:
"""The configuration of a Ludwig dataset."""
# The version of the dataset.
version: str
# The name of the dataset. Make this a valid python module name, should not contain spaces or dashes.
name: str
# The readable description of the dataset
description: str = ""
# Fallback mirrors. Paths must be in local/remote filesystems.
fallback_mirrors: list[DatasetFallbackMirror] | None = None
# Optional. The (suggested) output features for this dataset. Helps users discover new datasets and filter for
# relevance to a specific machine learning setting.
output_features: list[dict] = field(default_factory=list)
# The kaggle competition this dataset belongs to, or None if this dataset is not hosted by a Kaggle competition.
kaggle_competition: str | None = None
# The kaggle dataset ID, or None if this dataset if not hosted by Kaggle.
kaggle_dataset_id: str | None = None
# The list of URLs to download.
download_urls: str | list[str] = field(default_factory=list)
# The list of file archives which will be downloaded. If download_urls contains a filename with extension, for
# example https://domain.com/archive.zip, then archive_filenames does not need to be specified.
archive_filenames: str | list[str] = field(default_factory=list)
# The names of files in the dataset (after extraction). Glob-style patterns are supported, see
# https://docs.python.org/3/library/glob.html
dataset_filenames: str | list[str] = field(default_factory=list)
# If the dataset contains separate files for training, testing, or validation. Glob-style patterns are supported,
# see https://docs.python.org/3/library/glob.html
train_filenames: str | list[str] = field(default_factory=list)
validation_filenames: str | list[str] = field(default_factory=list)
test_filenames: str | list[str] = field(default_factory=list)
# If the dataset contains additional referenced files or directories (ex. images or audio) list them here and they
# will be copied to the same location as the processed dataset. Glob-style patterns are supported,
# see https://docs.python.org/3/library/glob.html
preserve_paths: str | list[str] = field(default_factory=list)
# Optionally verify integrity of the dataset by providing sha256 checksums for important files. Maps filename to
# sha256 digest. Use `sha256sum ` on linux, `shasum -a 256 ` on Mac to get checksums.
# If verification fails, loading the dataset will fail with a ValueError.
# If no sha256 digests are in the config, a warning is logged and the dataset will load without verification.
sha256: dict[str, str] = field(default_factory=dict)
# List of column names, for datasets which do not have column names. If specified, will override the column names
# already present in the dataset.
columns: list[dict] = field(default_factory=list)
# Optional dictionary which maps column name to column type. Column's will be converted to the requested type, or
# will be inferred from the dataset by default.
column_types: dict[str, str] = field(default_factory=dict)
# The loader module and class to use, relative to ludwig.datasets.loaders. Only change this if the dataset requires
# processing which is not handled by the default loader.
loader: str = "dataset_loader.DatasetLoader"
================================================
FILE: ludwig/datasets/kaggle.py
================================================
import os
from contextlib import contextmanager
from ludwig.utils.fs_utils import upload_output_directory
def create_kaggle_client():
# Need to import here to prevent Kaggle from authenticating on import
from kaggle import api
return api
@contextmanager
def update_env(**kwargs):
override_env = {k: v for k, v in kwargs.items() if v is not None}
old = os.environ.copy()
try:
os.environ.update(override_env)
yield
finally:
os.environ = old
def download_kaggle_dataset(
download_directory: str,
kaggle_dataset_id: str | None = None,
kaggle_competition: str | None = None,
kaggle_username: str | None = None,
kaggle_key: str | None = None,
):
"""Download all files in a kaggle dataset. One of kaggle_dataset_id,
If the user has not specified creds in the kaggle.json file we lookup the passed in username and the api key and
perform authentication.
"""
with update_env(KAGGLE_USERNAME=kaggle_username, KAGGLE_KEY=kaggle_key):
# Call authenticate explicitly to pick up new credentials if necessary
api = create_kaggle_client()
api.authenticate()
with upload_output_directory(download_directory) as (tmpdir, _):
if kaggle_competition:
api.competition_download_files(kaggle_competition, path=tmpdir)
else:
api.dataset_download_files(kaggle_dataset_id, path=tmpdir)
return [os.path.join(download_directory, f) for f in os.listdir(download_directory)]
================================================
FILE: ludwig/datasets/loaders/__init__.py
================================================
================================================
FILE: ludwig/datasets/loaders/adult_census_income.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class AdultCensusIncomeLoader(DatasetLoader):
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
if file_path.endswith(".test"):
# The test file contains the line "|1x3 Cross validator" before the CSV content.
return pd.read_csv(file_path, skiprows=1)
return super().load_file_to_dataframe(file_path)
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
processed_df = super().transform_dataframe(dataframe)
processed_df["income"] = processed_df["income"].str.rstrip(".")
processed_df["income"] = processed_df["income"].str.strip()
return processed_df
================================================
FILE: ludwig/datasets/loaders/agnews.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class AGNewsLoader(DatasetLoader):
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
processed_df = super().transform_dataframe(dataframe)
# Maps class_index to class name.
class_names = ["", "world", "sports", "business", "sci_tech"]
# Adds new column 'class' by mapping class indexes to strings.
processed_df["class"] = processed_df.class_index.apply(lambda i: class_names[i])
# Agnews has no validation split, only train and test (0, 2). For convenience, we'll designate the first 5% of
# each class from the training set as the validation set.
val_set_n = int((len(processed_df) * 0.05) // len(class_names)) # rows from each class in validation set.
for ci in range(1, 5):
# For each class, reassign the first val_set_n rows of the training set to validation set.
train_rows = processed_df[(processed_df.split == 0) & (processed_df.class_index == ci)].index
processed_df.loc[train_rows[:val_set_n], "split"] = 1
return processed_df
================================================
FILE: ludwig/datasets/loaders/allstate_claims_severity.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class AllstateClaimsSeverityLoader(DatasetLoader):
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
if os.path.basename(file_path) == "train.csv":
# train.csv has been updated with quoted test rows at the end; don't load these, only load the original
# training set.
return pd.read_csv(file_path, nrows=188319)
if os.path.basename(file_path) == "test.csv":
# we limit the loaded rows for the same reason as the training set.
return pd.read_csv(file_path, nrows=125547)
super().load_file_to_dataframe(file_path)
================================================
FILE: ludwig/datasets/loaders/camseq.py
================================================
# Copyright (c) 2023 Aizen Corp.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
from ludwig.utils.fs_utils import makedirs
class CamseqLoader(DatasetLoader):
def transform_files(self, file_paths: list[str]) -> list[str]:
if not os.path.exists(self.processed_dataset_dir):
os.makedirs(self.processed_dataset_dir)
# move images and masks into separate directories
source_dir = self.raw_dataset_dir
images_dir = os.path.join(source_dir, "images")
masks_dir = os.path.join(source_dir, "masks")
makedirs(images_dir, exist_ok=True)
makedirs(masks_dir, exist_ok=True)
data_files = []
for f in os.listdir(source_dir):
if f.endswith("_L.png"): # masks
dest_file = os.path.join(masks_dir, f)
elif f.endswith(".png"): # images
dest_file = os.path.join(images_dir, f)
else:
continue
source_file = os.path.join(source_dir, f)
os.replace(source_file, dest_file)
data_files.append(dest_file)
return super().transform_files(data_files)
def load_unprocessed_dataframe(self, file_paths: list[str]) -> pd.DataFrame:
"""Creates a dataframe of image paths and mask paths."""
images_dir = os.path.join(self.processed_dataset_dir, "images")
masks_dir = os.path.join(self.processed_dataset_dir, "masks")
images = []
masks = []
for f in os.listdir(images_dir):
images.append(os.path.join(images_dir, f))
mask_f = f[:-4] + "_L.png"
masks.append(os.path.join(masks_dir, mask_f))
return pd.DataFrame({"image_path": images, "mask_path": masks})
================================================
FILE: ludwig/datasets/loaders/code_alpaca_loader.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class CodeAlpacaLoader(DatasetLoader):
"""The Code Alpaca dataset."""
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
"""Loads a file into a dataframe."""
df = pd.read_json(file_path)
return df
================================================
FILE: ludwig/datasets/loaders/consumer_complaints_loader.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class ConsumerComplaintsLoader(DatasetLoader):
"""The Consumer Complaints dataset."""
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
"""Loads a file into a dataframe."""
consumer_complaints_df = pd.read_csv(file_path)
consumer_complaints_df = preprocess_df(consumer_complaints_df)
return consumer_complaints_df
def preprocess_df(df):
"""Preprocesses the dataframe.
- Remove all rows with missing values in the following columns:
- Consumer complaint narrative
- Issue
- Product
Args:
df (pd.DataFrame): The dataframe to preprocess.
Returns:
pd.DataFrame: The preprocessed dataframe.
"""
return df.dropna(subset=["Consumer complaint narrative", "Issue", "Product"])
================================================
FILE: ludwig/datasets/loaders/creditcard_fraud.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class CreditCardFraudLoader(DatasetLoader):
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
processed_df = super().transform_dataframe(dataframe)
# Train/Test split like https://www.kaggle.com/competitions/1056lab-fraud-detection-in-credit-card/overview
processed_df = processed_df.sort_values(by=["Time"])
processed_df.loc[:198365, "split"] = 0
processed_df.loc[198365:, "split"] = 2
processed_df.split = processed_df.split.astype(int)
return processed_df
================================================
FILE: ludwig/datasets/loaders/dataset_loader.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from __future__ import annotations
import glob
import hashlib
import logging
import os
import shutil
import urllib
from enum import Enum
from urllib.parse import urlparse
import pandas as pd
from tqdm import tqdm
from ludwig.api_annotations import DeveloperAPI, PublicAPI
from ludwig.constants import SPLIT
from ludwig.datasets.archives import extract_archive, is_archive, list_archive
from ludwig.datasets.dataset_config import DatasetConfig, DatasetFallbackMirror
from ludwig.datasets.kaggle import download_kaggle_dataset
from ludwig.datasets.utils import model_configs_for_dataset
from ludwig.utils.fs_utils import get_default_cache_location, get_fs_and_path
from ludwig.utils.strings_utils import make_safe_filename
logger = logging.getLogger(__name__)
@DeveloperAPI
class TqdmUpTo(tqdm):
"""Provides progress bar for `urlretrieve`.
Taken from: https://gist.github.com/leimao/37ff6e990b3226c2c9670a2cd1e4a6f5
"""
def update_to(self, b=1, bsize=1, tsize=None):
"""
b : int, optional
Number of blocks transferred so far [default: 1].
bsize : int, optional
Size of each block (in tqdm units) [default: 1].
tsize : int, optional
Total size (in tqdm units). If [default: None] remains unchanged.
"""
if tsize is not None:
self.total = tsize # noqa W0201
self.update(b * bsize - self.n) # will also set self.n = b * bsize
def _list_of_strings(list_or_string: str | list[str]) -> list[str]:
"""Helper function to accept single string or lists in config."""
return [list_or_string] if isinstance(list_or_string, str) else list_or_string
def _glob_multiple(pathnames: list[str], root_dir: str = None, recursive: bool = True) -> set[str]:
"""Recursive glob multiple patterns, returns set of matches.
Note: glob's root_dir argument was added in python 3.10, not using it for compatibility.
"""
if root_dir:
pathnames = [os.path.join(root_dir, p) for p in pathnames]
return set().union(*[glob.glob(p, recursive=recursive) for p in pathnames])
def _sha256_digest(file_path) -> str:
"""Returns the sha256 digest for the specified file."""
hash = hashlib.sha256()
buffer = bytearray(hash.block_size * 1024) # Attempts to read in multiples of the hash block size (64KB).
mv = memoryview(buffer)
with open(file_path, "rb", buffering=0) as f:
for bytes_read in iter(lambda: f.readinto(mv), 0):
hash.update(mv[:bytes_read])
return hash.hexdigest()
@PublicAPI
class DatasetState(int, Enum):
"""The state of the dataset."""
NOT_LOADED = 0
DOWNLOADED = 1
EXTRACTED = 2
TRANSFORMED = 3
@PublicAPI
class DatasetLoader:
"""Base class that defines the default pipeline for loading a ludwig dataset.
Clients will typically call load(), which processes the dataset according to the config.
A dataset is processed in 4 phases:
1. Download - The dataset files are downloaded to the cache.
2. Verify - Hashes of downloaded files are verified.
3. Extract - The dataset files are extracted from an archive (may be a no-op if data is not archived).
4. Transform - The dataset is transformed into a format usable for training and is ready to load.
a. Transform Files (Files -> Files)
b. Load Dataframe (Files -> DataFrame)
c. Transform Dataframe (DataFrame -> DataFrame)
d. Save Processed (DataFrame -> File)
The download and extract phases are run for each URL based on the URL type and file extension. After extraction, the
full set of downloaded and extracted files are collected and passed as a list to the transform stage.
The transform phase offers customization points for datasets which require preprocessing before they are usable for
training.
"""
def __init__(self, config: DatasetConfig, cache_dir: str | None = None):
"""Constructor."""
self.config = config
self.cache_dir = cache_dir if cache_dir else get_default_cache_location()
@property
def name(self):
"""The name of the dataset."""
return self.config.name
@property
def version(self):
"""The version of the dataset."""
return self.config.version
@property
def is_kaggle_dataset(self) -> bool:
return self.config.kaggle_dataset_id or self.config.kaggle_competition
@property
def download_dir(self) -> str:
"""Directory where all dataset artifacts are saved."""
return os.path.join(self.cache_dir, f"{self.name}_{self.version}")
@property
def raw_dataset_dir(self) -> str:
"""Save path for raw data downloaded from the web."""
return os.path.join(self.download_dir, "raw")
@property
def processed_dataset_dir(self) -> str:
"""Save path for processed data."""
return os.path.join(self.download_dir, "processed")
@property
def processed_dataset_filename(self) -> str:
"""Filename for processed data."""
return f"{make_safe_filename(self.config.name)}.parquet"
@property
def processed_dataset_path(self) -> str:
"""Save path to processed dataset file."""
return os.path.join(self.processed_dataset_dir, self.processed_dataset_filename)
@property
def processed_temp_dir(self) -> str:
"""Save path for processed temp data."""
return os.path.join(self.download_dir, "_processed")
@property
def state(self) -> DatasetState:
"""Dataset state."""
if os.path.exists(self.processed_dataset_path):
return DatasetState.TRANSFORMED
if all([os.path.exists(os.path.join(self.raw_dataset_dir, filename)) for filename in self.download_filenames]):
archive_filenames = [f for f in self.download_filenames if is_archive(f)]
if archive_filenames:
# Check to see if archive has been extracted.
extracted_files = [
f for a in archive_filenames for f in list_archive(os.path.join(self.raw_dataset_dir, a))
]
if all(os.path.exists(os.path.join(self.raw_dataset_dir, ef)) for ef in extracted_files):
return DatasetState.EXTRACTED
else:
return DatasetState.DOWNLOADED
# If none of the dataset download files are archives, skip extraction phase.
return DatasetState.EXTRACTED
return DatasetState.NOT_LOADED
@property
def download_urls(self) -> list[str]:
return _list_of_strings(self.config.download_urls)
@property
def download_filenames(self) -> list[str]:
"""Filenames for downloaded files inferred from download_urls."""
if self.config.archive_filenames:
return _list_of_strings(self.config.archive_filenames)
return [os.path.basename(urlparse(url).path) for url in self.download_urls]
@staticmethod
def get_mirror_download_paths(mirror: DatasetFallbackMirror):
"""Filenames for downloaded files inferred from mirror download_paths."""
return _list_of_strings(mirror.download_paths)
def get_mirror_download_filenames(self, mirror: DatasetFallbackMirror):
"""Filenames for downloaded files inferred from mirror download_paths."""
if self.config.archive_filenames:
return _list_of_strings(self.config.archive_filenames)
return [os.path.basename(path) for path in mirror.download_paths]
def description(self) -> str:
"""Returns human-readable description of the dataset."""
return f"{self.config.name} {self.config.version}\n{self.config.description}"
@property
def model_configs(self) -> dict[str, dict]:
"""Returns a dictionary of built-in model configs for this dataset."""
return model_configs_for_dataset(self.config.name)
@property
def best_model_config(self) -> dict | None:
"""Returns the best built-in model config for this dataset, or None."""
return self.model_configs.get("best")
@property
def default_model_config(self) -> dict | None:
"""Returns the default built-in model config for this dataset.
This is a good first model which should train in under 10m on a current laptop without GPU acceleration.
"""
return self.model_configs.get("default")
def _get_preserved_paths(self, root_dir=None):
"""Gets list of files to preserve when exporting dataset, not including self.processed_dataset_path.
Returns paths relative to the dataset root directory.
"""
root_dir = root_dir if root_dir else self.processed_dataset_dir
preserved_paths = _glob_multiple(_list_of_strings(self.config.preserve_paths), root_dir=root_dir)
return [os.path.relpath(p, start=root_dir) for p in preserved_paths]
def export(self, output_directory: str) -> None:
"""Exports the dataset (and any files required by it) into the specified directory."""
self._download_and_process()
os.makedirs(output_directory, exist_ok=True)
shutil.copy2(self.processed_dataset_path, os.path.join(output_directory, self.processed_dataset_filename))
preserve_paths = self._get_preserved_paths()
for relative_path in preserve_paths:
source = os.path.join(self.processed_dataset_dir, relative_path)
destination = os.path.join(output_directory, relative_path)
if os.path.isdir(source):
shutil.copytree(source, destination, symlinks=False, dirs_exist_ok=True)
else:
shutil.copy2(source, destination)
def _download_and_process(self, kaggle_username: str | None = None, kaggle_key: str | None = None):
"""Loads the dataset, downloaded and processing it if needed.
If dataset is already processed, does nothing.
"""
if self.state == DatasetState.NOT_LOADED:
try:
self.download(kaggle_username=kaggle_username, kaggle_key=kaggle_key)
except Exception as e:
logger.warning(
f"Finding fallback mirrors to download the dataset. Downloading from "
f"the original source failed with the following error {e}."
)
if not self.config.fallback_mirrors:
logger.exception(f"No fallback mirror found. Failed to download dataset {self.config.name}.")
else:
self.download_from_fallback_mirrors()
self.verify()
if self.state == DatasetState.DOWNLOADED:
# Extract dataset
try:
self.extract()
except Exception:
logger.exception("Failed to extract dataset")
if self.state == DatasetState.EXTRACTED:
# Transform dataset
try:
self.transform()
except Exception:
logger.exception("Failed to transform dataset")
def load(
self, kaggle_username: str | None = None, kaggle_key: str | None = None, split: bool = False
) -> pd.DataFrame | list[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""Loads the dataset, downloaded and processing it if needed.
Note: This method is also responsible for splitting the data, returning a single dataframe if split=False, and a
3-tuple of train, val, test if split=True.
:param kaggle_username: (str) username on Kaggle platform
:param kaggle_key: (str) dataset key on Kaggle platform
:param split: (bool) splits dataset along 'split' column if present. The split column should always have values
0: train, 1: validation, 2: test.
"""
self._download_and_process(kaggle_username=kaggle_username, kaggle_key=kaggle_key)
if self.state == DatasetState.TRANSFORMED:
dataset_df = self.load_transformed_dataset()
if split:
return self.split(dataset_df)
else:
return dataset_df
def download(self, kaggle_username: str | None = None, kaggle_key: str | None = None) -> list[str]:
if not os.path.exists(self.raw_dataset_dir):
os.makedirs(self.raw_dataset_dir)
if self.is_kaggle_dataset:
return download_kaggle_dataset(
self.raw_dataset_dir,
kaggle_dataset_id=self.config.kaggle_dataset_id,
kaggle_competition=self.config.kaggle_competition,
kaggle_username=kaggle_username,
kaggle_key=kaggle_key,
)
else:
for url, filename in zip(self.download_urls, self.download_filenames):
downloaded_file_path = os.path.join(self.raw_dataset_dir, filename)
with TqdmUpTo(unit="B", unit_scale=True, unit_divisor=1024, miniters=1, desc=filename) as t:
urllib.request.urlretrieve(url, downloaded_file_path, t.update_to)
def download_from_fallback_mirrors(self):
for mirror in self.config.fallback_mirrors:
logger.info(f"Attempting download from mirror {mirror.name}.")
try:
download_paths = self.get_mirror_download_paths(mirror)
filenames = self.get_mirror_download_filenames(mirror)
for path, filename in zip(download_paths, filenames):
downloaded_file_path = os.path.join(self.raw_dataset_dir, filename)
with TqdmUpTo(unit="B", unit_scale=True, unit_divisor=1024, miniters=1, desc=filename):
fs, path = get_fs_and_path(path)
fs.get(path, downloaded_file_path)
return
except Exception:
logger.exception(f"Download from mirror `{mirror.name}` failed.")
def verify(self) -> None:
"""Verifies checksums for dataset."""
for filename, sha256sum in self.config.sha256.items():
digest = _sha256_digest(os.path.join(self.raw_dataset_dir, filename))
if digest != sha256sum:
raise ValueError(f"Checksum mismatch for file {filename} of {self.config.name} dataset")
if not self.config.sha256:
logger.warning(f"No sha256 digest provided for dataset {self.config.name}, cannot verify.")
logger.info("Contents:")
for filename in os.listdir(self.raw_dataset_dir):
path = os.path.join(self.raw_dataset_dir, filename)
if not os.path.isdir(path):
digest = _sha256_digest(path)
logger.info(f" {filename}: {digest}")
def extract(self) -> list[str]:
extracted_files = set()
for download_filename in self.download_filenames:
download_path = os.path.join(self.raw_dataset_dir, download_filename)
if is_archive(download_path):
extracted_files.update(extract_archive(download_path))
# If the archive contains archives, extract those too. For example, bnp_claims_management.
archive_contents = extracted_files.copy()
for extracted_file in archive_contents:
extracted_path = os.path.join(self.raw_dataset_dir, extracted_file)
if is_archive(extracted_path):
try:
extracted_files.update(extract_archive(extracted_path))
except RuntimeError as e:
logger.warning(f"Error extracting {extracted_file}" + str(e))
return list(extracted_files)
def transform(self) -> None:
data_filenames = [
os.path.join(self.raw_dataset_dir, f) for f in os.listdir(self.raw_dataset_dir) if not is_archive(f)
]
transformed_files = self.transform_files(data_filenames)
unprocessed_dataframe = self.load_unprocessed_dataframe(transformed_files)
transformed_dataframe = self.transform_dataframe(unprocessed_dataframe)
self.save_processed(transformed_dataframe)
def transform_files(self, file_paths: list[str]) -> list[str]:
"""Transform data files before loading to dataframe.
Subclasses should override this method to process files before loading dataframe, calling the base class
implementation after transformation if the results of transformation are needed by preserve_paths.
"""
data_files = [p for p in file_paths if not os.path.isdir(p)]
if not os.path.exists(self.processed_dataset_dir):
os.makedirs(self.processed_dataset_dir)
# Moves any preserved paths (ex. image directories) into processed directory to avoid unnecessary copy.
for rel_path in self._get_preserved_paths(self.raw_dataset_dir):
source_path = os.path.join(self.raw_dataset_dir, rel_path)
dest_path = os.path.join(self.processed_dataset_dir, rel_path)
if os.path.exists(source_path) and not os.path.exists(dest_path):
os.replace(source_path, dest_path)
return data_files
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
"""Loads a file into a dataframe.
Subclasses may override this method to support other input formats (json, jsonl, tsv, csv, parquet)
"""
file_extension = os.path.splitext(file_path)[-1].lower()
if file_extension == ".json":
return pd.read_json(file_path)
elif file_extension == ".jsonl":
return pd.read_json(file_path, lines=True)
elif file_extension == ".tsv":
return pd.read_table(file_path)
elif file_extension in {".csv", ".data"}:
return pd.read_csv(file_path)
elif file_extension in {".parquet", ".pq", ".pqt"}:
return pd.read_parquet(file_path)
else:
raise ValueError(f"Unsupported dataset file type: {file_extension}")
def load_files_to_dataframe(self, file_paths: list[str], root_dir=None) -> pd.DataFrame:
"""Loads a file or list of files and returns a dataframe.
Subclasses may override this method to change the loader's behavior for groups of files.
"""
if root_dir:
file_paths = [os.path.join(root_dir, path) for path in file_paths]
dataframes = [self.load_file_to_dataframe(path) for path in file_paths]
try:
if self.config.columns:
column_names = [column["name"] for column in self.config.columns]
set_cols_dfs = []
for df in dataframes:
# Split column is not included in configs, add in if pre-set split is present
if SPLIT in df.columns:
column_names.append(SPLIT)
# If the number of columns in the dataframe does not match the number of columns in the config,
# then the dataframe likely has an extra column that we don't want - i.e. "Unnamed: 0".
if len(column_names) != len(df.columns):
df = df[column_names]
set_cols_dfs.append(df.set_axis(column_names, axis=1))
return pd.concat(set_cols_dfs, ignore_index=True)
else:
return pd.concat(dataframes, ignore_index=True)
except ValueError as e:
logger.warning(f"Error setting column names: {e}")
return pd.concat(dataframes, ignore_index=True)
def load_unprocessed_dataframe(self, file_paths: list[str]) -> pd.DataFrame:
"""Load dataset files into a dataframe.
Will use the list of data files in the dataset directory as a default if all of config's dataset_filenames,
train_filenames, validation_filenames, test_filenames are empty.
"""
dataset_paths = _glob_multiple(_list_of_strings(self.config.dataset_filenames), root_dir=self.raw_dataset_dir)
train_paths = _glob_multiple(_list_of_strings(self.config.train_filenames), root_dir=self.raw_dataset_dir)
validation_paths = _glob_multiple(
_list_of_strings(self.config.validation_filenames), root_dir=self.raw_dataset_dir
)
test_paths = _glob_multiple(_list_of_strings(self.config.test_filenames), root_dir=self.raw_dataset_dir)
if self.config.name == "hugging_face":
dataframes = self._get_dataframe_with_fixed_splits_from_hf()
else:
dataframes = self._get_dataframe_with_fixed_splits(
train_paths, validation_paths, test_paths, dataset_paths, file_paths
)
return pd.concat(dataframes, ignore_index=True)
def _get_dataframe_with_fixed_splits_from_hf(self):
dataframes = []
splits = ["train", "validation", "test"]
data_dict = self.load_hf_to_dict(
self.config.huggingface_dataset_id, self.config.huggingface_subset
) # This function is defined in the Hugging Face dataloader
for split_type in splits:
if split_type in data_dict:
# We don't have to do anything if split not in data_dict because we just concatenate the dataframes
# in the end anyway.
data_dict[split_type][SPLIT] = splits.index(split_type) # Add "split" column (0, 1, or 2)
dataframes.append(data_dict[split_type])
return dataframes
def _get_dataframe_with_fixed_splits(self, train_paths, validation_paths, test_paths, dataset_paths, file_paths):
dataframes = []
if len(train_paths) > 0:
train_df = self.load_files_to_dataframe(train_paths)
train_df[SPLIT] = 0
dataframes.append(train_df)
if len(validation_paths) > 0:
validation_df = self.load_files_to_dataframe(validation_paths)
validation_df[SPLIT] = 1
dataframes.append(validation_df)
if len(test_paths) > 0:
test_df = self.load_files_to_dataframe(test_paths)
test_df[SPLIT] = 2
dataframes.append(test_df)
# If we have neither train/validation/test files nor dataset_paths in the config,
# use data files in root dir.
if len(dataset_paths) == len(dataframes) == 0:
dataset_paths = file_paths
if len(dataset_paths) > 0:
dataframes.append(self.load_files_to_dataframe(dataset_paths))
return dataframes
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""Transforms a dataframe of the entire dataset.
Subclasses should override this method if transformation of the dataframe is needed.
"""
for column_name, type in self.config.column_types.items():
dataframe[column_name] = dataframe[column_name].astype(type)
return dataframe
def save_processed(self, dataframe: pd.DataFrame) -> None:
"""Saves transformed dataframe as a flat file ludwig can load for training."""
if not os.path.exists(self.processed_dataset_dir):
os.makedirs(self.processed_dataset_dir)
dataframe.to_parquet(self.processed_dataset_path, engine="pyarrow")
def load_transformed_dataset(self) -> pd.DataFrame:
"""Load processed dataset into a dataframe."""
return pd.read_parquet(self.processed_dataset_path)
def get_mtime(self) -> float:
"""Last modified time of the processed dataset after downloading successfully."""
return os.path.getmtime(self.processed_dataset_path)
@staticmethod
def split(dataset: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
if SPLIT in dataset:
dataset[SPLIT] = pd.to_numeric(dataset[SPLIT])
training_set = dataset[dataset[SPLIT] == 0].drop(columns=[SPLIT])
val_set = dataset[dataset[SPLIT] == 1].drop(columns=[SPLIT])
test_set = dataset[dataset[SPLIT] == 2].drop(columns=[SPLIT])
return training_set, test_set, val_set
else:
raise ValueError(f"The dataset does not a '{SPLIT}' column, load with `split=False`")
================================================
FILE: ludwig/datasets/loaders/ethos_binary.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class EthosBinaryLoader(DatasetLoader):
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
# This dataset uses ; seperator instead of ,
return pd.read_csv(file_path, sep=";")
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
processed_df = super().transform_dataframe(dataframe)
# convert float labels (0.0, 1.0) to binary labels
processed_df["isHate"] = processed_df["isHate"] >= 0.5
processed_df["isHate"] = processed_df["isHate"].astype(int)
return processed_df
================================================
FILE: ludwig/datasets/loaders/flickr8k.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import re
from collections import defaultdict
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class Flickr8kLoader(DatasetLoader):
def transform_files(self, file_paths: list[str]) -> list[str]:
# create a dictionary matching image_path --> list of captions
image_to_caption = defaultdict(list)
with open(f"{self.raw_dataset_dir}/Flickr8k.token.txt") as captions_file:
image_to_caption = defaultdict(list)
for line in captions_file:
line = line.split("#")
# the regex is to format the string to fit properly in a csv
line[1] = line[1].strip("\n01234.\t ")
line[1] = re.sub('"', '""', line[1])
line[1] = '"' + line[1] + '"'
image_to_caption[line[0]].append(line[1])
# create csv file with 7 columns: image_path, 5 captions, and split
with open(os.path.join(self.raw_dataset_dir, "flickr8k_dataset.csv"), "w") as output_file:
output_file.write("image_path,caption0,caption1,caption2,")
output_file.write("caption3,caption4,split\n")
splits = ["train", "dev", "test"]
for i in range(len(splits)):
split = splits[i]
with open(f"{self.raw_dataset_dir}/Flickr_8k.{split}Images.txt") as split_file:
for image_name in split_file:
image_name = image_name.strip("\n")
if image_name in image_to_caption:
output_file.write(
"{},{},{},{},{},{},{}\n".format(
# Note: image folder is named Flicker8k_Dataset
f"{self.raw_dataset_dir}/Flicker8k_Dataset/{image_name}",
*image_to_caption[image_name],
i,
)
)
return super().transform_files(file_paths)
================================================
FILE: ludwig/datasets/loaders/forest_cover.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from sklearn.model_selection import train_test_split
from ludwig.datasets.dataset_config import DatasetConfig
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class ForestCoverLoader(DatasetLoader):
def __init__(self, config: DatasetConfig, cache_dir: str | None = None, use_tabnet_split=True):
super().__init__(config, cache_dir=cache_dir)
self.use_tabnet_split = use_tabnet_split
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
df = super().transform_dataframe(dataframe)
# Elevation quantitative meters Elevation in meters
# Aspect quantitative azimuth Aspect in degrees azimuth
# Slope quantitative degrees Slope in degrees
# Horizontal_Distance_To_Hydrology quantitative meters Horz Dist to nearest surface water features # noqa: E501
# Vertical_Distance_To_Hydrology quantitative meters Vert Dist to nearest surface water features # noqa: E501
# Horizontal_Distance_To_Roadways quantitative meters Horz Dist to nearest roadway # noqa: E501
# Hillshade_9am quantitative 0 to 255 index Hillshade index at 9am, summer solstice # noqa: E501
# Hillshade_Noon quantitative 0 to 255 index Hillshade index at noon, summer soltice # noqa: E501
# Hillshade_3pm quantitative 0 to 255 index Hillshade index at 3pm, summer solstice # noqa: E501
# Horizontal_Distance_To_Fire_Points quantitative meters Horz Dist to nearest wildfire ignition points # noqa: E501
# Wilderness_Area (4 binary columns) qualitative 0 (absence) or 1 (presence) Wilderness area designation # noqa: E501
# Soil_Type (40 binary columns) qualitative 0 (absence) or 1 (presence) Soil Type designation
# Cover_Type (7 types) integer 1 to 7 Forest Cover Type designation # noqa: E501
# Map the 40 soil types to a single integer instead of 40 binary columns
st_cols = [
"Soil_Type_1",
"Soil_Type_2",
"Soil_Type_3",
"Soil_Type_4",
"Soil_Type_5",
"Soil_Type_6",
"Soil_Type_7",
"Soil_Type_8",
"Soil_Type_9",
"Soil_Type_10",
"Soil_Type_11",
"Soil_Type_12",
"Soil_Type_13",
"Soil_Type_14",
"Soil_Type_15",
"Soil_Type_16",
"Soil_Type_17",
"Soil_Type_18",
"Soil_Type_19",
"Soil_Type_20",
"Soil_Type_21",
"Soil_Type_22",
"Soil_Type_23",
"Soil_Type_24",
"Soil_Type_25",
"Soil_Type_26",
"Soil_Type_27",
"Soil_Type_28",
"Soil_Type_29",
"Soil_Type_30",
"Soil_Type_31",
"Soil_Type_32",
"Soil_Type_33",
"Soil_Type_34",
"Soil_Type_35",
"Soil_Type_36",
"Soil_Type_37",
"Soil_Type_38",
"Soil_Type_39",
"Soil_Type_40",
]
st_vals = []
for _, row in df[st_cols].iterrows():
st_vals.append(row.to_numpy().nonzero()[0].item(0))
df = df.drop(columns=st_cols)
df["Soil_Type"] = st_vals
# Map the 4 wilderness areas to a single integer
# instead of 4 binary columns
wa_cols = ["Wilderness_Area_1", "Wilderness_Area_2", "Wilderness_Area_3", "Wilderness_Area_4"]
wa_vals = []
for _, row in df[wa_cols].iterrows():
wa_vals.append(row.to_numpy().nonzero()[0].item(0))
df = df.drop(columns=wa_cols)
df["Wilderness_Area"] = wa_vals
if not self.use_tabnet_split:
# first 11340 records used for training data subset
# next 3780 records used for validation data subset
# last 565892 records used for testing data subset
df["split"] = [0] * 11340 + [1] * 3780 + [2] * 565892
else:
# Split used in the tabNet paper
# https://github.com/google-research/google-research/blob/master/tabnet/download_prepare_covertype.py
train_val_indices, test_indices = train_test_split(range(len(df)), test_size=0.2, random_state=0)
train_indices, val_indices = train_test_split(train_val_indices, test_size=0.2 / 0.6, random_state=0)
df["split"] = 0
df.loc[val_indices, "split"] = 1
df.loc[test_indices, "split"] = 2
return df
================================================
FILE: ludwig/datasets/loaders/goemotions.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class GoEmotionsLoader(DatasetLoader):
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
processed_df = super().transform_dataframe(dataframe)
# Format emotion IDs as space-delimited string (Set).
processed_df["emotion_ids"] = processed_df["emotion_ids"].apply(lambda e_id: " ".join(e_id.split(",")))
return processed_df
================================================
FILE: ludwig/datasets/loaders/higgs.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.dataset_config import DatasetConfig
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class HiggsLoader(DatasetLoader):
def __init__(self, config: DatasetConfig, cache_dir: str | None = None, add_validation_set=True):
super().__init__(config, cache_dir)
self.add_validation_set = add_validation_set
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
"""Loads a file into a dataframe."""
return pd.read_csv(file_path, header=None)
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
processed_df = super().transform_dataframe(dataframe)
if self.add_validation_set:
processed_df["split"] = [0] * 10000000 + [1] * 500000 + [2] * 500000
else:
processed_df["split"] = [0] * 10500000 + [2] * 500000
return processed_df
================================================
FILE: ludwig/datasets/loaders/hugging_face.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from __future__ import annotations
import logging
import datasets
import pandas as pd
from ludwig.constants import TEST, TRAIN, VALIDATION
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
SPLITS = [TRAIN, VALIDATION, TEST]
logger = logging.getLogger(__name__)
class HFLoader(DatasetLoader):
"""HFLoader differs from all other DatasetLoaders because of how it loads data through the Hugging Face
datasets API instead of saving any files to the cache.
The config for HFLoader contains two unique parameters, huggingface_dataset_id and huggingface_subsample, that
identify which dataset and which subsample of that dataset to load in.
"""
@staticmethod
def load_hf_to_dict(hf_id: str | None = None, hf_subsample: str | None = None) -> dict[str, pd.DataFrame]:
"""Returns a map of split -> pd.DataFrame for the given HF dataset.
:param hf_id: (str) path to dataset on HuggingFace platform
:param hf_subsample: (str) name of dataset configuration on HuggingFace platform
"""
dataset_dict: dict[str, datasets.Dataset] = datasets.load_dataset(path=hf_id, name=hf_subsample)
pandas_dict = {}
for split in dataset_dict:
# Convert from HF DatasetDict type to a dictionary of pandas dataframes
pandas_dict[split] = dataset_dict[split].to_pandas()
return pandas_dict
# TODO(Alex): Standardize load() signature as interface method in DatasetLoader and adhere to it in all subclasses.
def load(
self, hf_id: str | None = None, hf_subsample: str | None = None, split: bool = False
) -> pd.DataFrame | list[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""When load() is called, HFLoader calls the datasets API to return all of the data in a HuggingFace
DatasetDict, converts it to a dictionary of pandas dataframes, and returns either three dataframes
containing train, validation, and test data or one dataframe that is the concatenation of all three
depending on whether `split` is set to True or False.
:param split: (bool) directive for how to interpret if dataset contains validation or test set (see below)
Note that some datasets may not provide a validation set or a test set. In this case:
- If split is True, the DataFrames corresponding to the missing sets are initialized to be empty
- If split is False, the "split" column in the resulting DataFrame will reflect the fact that there is no
validation/test split (i.e., there will be no 1s/2s)
A train set should always be provided by Hugging Face.
:param hf_id: (str) path to dataset on HuggingFace platform
:param hf_subsample: (str) name of dataset configuration on HuggingFace platform
"""
self.config.huggingface_dataset_id = hf_id
self.config.huggingface_subsample = hf_subsample
pandas_dict = self.load_hf_to_dict(
hf_id=hf_id,
hf_subsample=hf_subsample,
)
if split: # For each split, either return the appropriate dataframe or an empty dataframe
for spl in SPLITS:
if spl not in pandas_dict:
logger.warning(f"No {spl} set found in provided Hugging Face dataset. Skipping {spl} set.")
train_df = pandas_dict[TRAIN] if TRAIN in pandas_dict else pd.DataFrame()
validation_df = pandas_dict[VALIDATION] if VALIDATION in pandas_dict else pd.DataFrame()
test_df = pandas_dict[TEST] if TEST in pandas_dict else pd.DataFrame()
return train_df, validation_df, test_df
else:
dataset_list = []
for spl in pandas_dict:
pandas_dict[spl]["split"] = SPLITS.index(spl) # Add a column containing 0s, 1s, and 2s denoting splits
dataset_list.append(pandas_dict[spl])
return pd.concat(dataset_list)
================================================
FILE: ludwig/datasets/loaders/ieee_fraud.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class IEEEFraudLoader(DatasetLoader):
"""The IEEE-CIS Fraud Detection Dataset https://www.kaggle.com/c/ieee-fraud-detection/overview."""
def load_unprocessed_dataframe(self, file_paths: list[str]) -> pd.DataFrame:
"""Load dataset files into a dataframe."""
train_files = {"train_identity.csv", "train_transaction.csv"}
test_files = {"test_identity.csv", "test_transaction.csv"}
train_dfs, test_dfs = {}, {}
for filename in train_files.union(test_files):
split_name = os.path.splitext(filename)[0]
file_df = self.load_file_to_dataframe(os.path.join(self.raw_dataset_dir, filename))
if filename in train_files:
train_dfs[split_name] = file_df
elif filename in test_files:
test_dfs[split_name] = file_df
# Merge on TransactionID
final_train = pd.merge(
train_dfs["train_transaction"], train_dfs["train_identity"], on="TransactionID", how="left"
)
return final_train
================================================
FILE: ludwig/datasets/loaders/insurance_lite.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class InsuranceLiteLoader(DatasetLoader):
"""Health Insurance Cross Sell Prediction Predict Health Insurance Owners' who will be interested in Vehicle
Insurance https://www.kaggle.com/datasets/arashnic/imbalanced-data-practice."""
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
df = super().transform_dataframe(dataframe)
# Make image paths relative to dataset root directory
df["image_path"] = df["image_path"].apply(
lambda x: os.path.join("Fast_Furious_Insured", "trainImages", os.path.basename(x))
)
return df
================================================
FILE: ludwig/datasets/loaders/kdd_loader.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import pandas as pd
from ludwig.datasets.dataset_config import DatasetConfig
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class KDDCup2009Loader(DatasetLoader):
def __init__(self, config: DatasetConfig, cache_dir: str | None = None, task_name="", include_test_download=False):
super().__init__(config, cache_dir=cache_dir)
self.task_name = task_name
self.include_test_download = include_test_download
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
"""Loads a file into a dataframe."""
return pd.read_csv(file_path, sep="\t")
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
train_df = super().transform_dataframe(dataframe)
train_df = process_categorical_features(train_df, categorical_features)
train_df = process_number_features(train_df, categorical_features)
targets = (
pd.read_csv(os.path.join(self.raw_dataset_dir, f"orange_small_train_{self.task_name}.labels"), header=None)[
0
]
.astype(str)
.apply(lambda x: "true" if x == "1" else "false")
)
train_idcs = pd.read_csv(
os.path.join(self.raw_dataset_dir, f"stratified_train_idx_{self.task_name}.txt"), header=None
)[0]
val_idcs = pd.read_csv(
os.path.join(self.raw_dataset_dir, f"stratified_test_idx_{self.task_name}.txt"), header=None
)[0]
processed_train_df = train_df.iloc[train_idcs].copy()
processed_train_df["target"] = targets.iloc[train_idcs]
processed_train_df["split"] = 0
processed_val_df = train_df.iloc[val_idcs].copy()
processed_val_df["target"] = targets.iloc[val_idcs]
processed_val_df["split"] = 1
if self.include_test_download:
test_df = self.load_file_to_dataframe(os.path.join(self.raw_dataset_dir, "orange_small_test.data"))
test_df["target"] = "" # no ground truth labels for test download
test_df["split"] = 2
df = pd.concat([processed_train_df, processed_val_df, test_df])
else:
df = pd.concat([processed_train_df, processed_val_df])
return df
def process_categorical_features(df, categorical_features):
for i in categorical_features:
df.iloc[:, i].fillna("", inplace=True)
return df
def process_number_features(df, categorical_features):
for i, column in enumerate(df.columns):
if i not in categorical_features:
df[column].astype(float, copy=False)
return df
categorical_features = {
190,
191,
192,
193,
194,
195,
196,
197,
198,
199,
200,
201,
202,
203,
204,
205,
206,
207,
209,
210,
211,
212,
213,
214,
215,
216,
217,
218,
219,
220,
221,
222,
223,
224,
225,
226,
227,
228,
}
class KDDAppetencyLoader(KDDCup2009Loader):
"""The KDD Cup 2009 Appetency dataset.
https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
"""
def __init__(self, config: DatasetConfig, cache_dir: str | None = None, include_test_download=False):
super().__init__(
config, cache_dir=cache_dir, task_name="appetency", include_test_download=include_test_download
)
class KDDChurnLoader(KDDCup2009Loader):
"""The KDD Cup 2009 Churn dataset.
https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
"""
def __init__(self, config: DatasetConfig, cache_dir: str | None = None, include_test_download=False):
super().__init__(config, cache_dir=cache_dir, task_name="churn", include_test_download=include_test_download)
class KDDUpsellingLoader(KDDCup2009Loader):
"""The KDD Cup 2009 Upselling dataset.
https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data
"""
def __init__(self, config: DatasetConfig, cache_dir: str | None = None, include_test_download=False):
super().__init__(
config, cache_dir=cache_dir, task_name="upselling", include_test_download=include_test_download
)
================================================
FILE: ludwig/datasets/loaders/mnist.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import os
import struct
from multiprocessing.pool import ThreadPool
import numpy as np
import pandas as pd
import torch
from ludwig.datasets.dataset_config import DatasetConfig
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
from ludwig.utils.fs_utils import makedirs
logger = logging.getLogger(__name__)
NUM_LABELS = 10
class MNISTLoader(DatasetLoader):
def __init__(self, config: DatasetConfig, cache_dir: str | None = None):
try:
from torchvision.io import write_png
self.write_png = write_png
except ImportError:
logger.error(
"torchvision is not installed. "
"In order to install all image feature dependencies run "
"pip install ludwig[image]"
)
raise
super().__init__(config, cache_dir)
def transform_files(self, file_paths: list[str]) -> list[str]:
for dataset in ["training", "testing"]:
labels, images = self.read_source_dataset(dataset, self.raw_dataset_dir)
self.write_output_dataset(labels, images, os.path.join(self.raw_dataset_dir, dataset))
return super().transform_files(file_paths)
def load_unprocessed_dataframe(self, file_paths: list[str]) -> pd.DataFrame:
"""Load dataset files into a dataframe."""
return self.output_training_and_test_data()
def read_source_dataset(self, dataset="training", path="."):
"""Create a directory for training and test and extract all the images and labels to this destination.
:args: dataset (str) : the label for the dataset path (str): the raw dataset path
:returns: A tuple of the label for the image, the file array, the size and rows and columns for the image
"""
if dataset == "training":
fname_img = os.path.join(path, "train-images-idx3-ubyte")
fname_lbl = os.path.join(path, "train-labels-idx1-ubyte")
elif dataset == "testing":
fname_img = os.path.join(path, "t10k-images-idx3-ubyte")
fname_lbl = os.path.join(path, "t10k-labels-idx1-ubyte")
else:
raise ValueError("dataset must be 'testing' or 'training'")
with open(fname_lbl, "rb") as flbl:
struct.unpack(">II", flbl.read(8))
lbl = np.frombuffer(flbl.read(), dtype=np.uint8)
with open(fname_img, "rb") as fimg:
magic_nr, size, rows, cols = struct.unpack(">IIII", fimg.read(16))
img = np.frombuffer(fimg.read(), dtype=np.uint8)
img = img.reshape((size, rows, cols))
return lbl, img
def write_output_dataset(self, labels, images, output_dir):
"""Create output directories where we write out the images.
:args: labels (str) : the labels for the image data (np.array) : the binary array corresponding to the image
output_dir (str) : the output directory that we need to write to path (str): the raw dataset path
:returns: A tuple of the label for the image, the file array, the size and rows and columns for the image
"""
# create child image output directories
output_dirs = [os.path.join(output_dir, str(i)) for i in range(NUM_LABELS)]
for output_dir in output_dirs:
makedirs(output_dir, exist_ok=True)
def write_processed_image(t):
i, label = t
output_filename = os.path.join(output_dirs[label], str(i) + ".png")
torch_image = torch.from_numpy(images[i].copy()).view(1, 28, 28)
self.write_png(torch_image, output_filename)
# write out image data
tasks = list(enumerate(labels))
pool = ThreadPool(NUM_LABELS)
pool.map(write_processed_image, tasks)
pool.close()
pool.join()
def output_training_and_test_data(self):
"""Creates a combined (training and test) dataframe by iterating through all the images and labels."""
dataframes = []
for name in ["training", "testing"]:
labels = []
paths = []
splits = []
for i in range(NUM_LABELS):
label_dir = f"{name}/{i}"
img_dir = os.path.join(self.processed_dataset_dir, label_dir)
for file in os.listdir(img_dir):
if file.endswith(".png"):
labels.append(str(i))
paths.append(os.path.join(img_dir, file))
splits.append(0 if name == "training" else 2)
dataframes.append(pd.DataFrame({"image_path": paths, "label": labels, "split": splits}))
return pd.concat(dataframes, ignore_index=True)
================================================
FILE: ludwig/datasets/loaders/naval.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class NavalLoader(DatasetLoader):
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
"""Loads a file into a dataframe."""
return pd.read_csv(file_path, header=None, sep=" ")
================================================
FILE: ludwig/datasets/loaders/rossman_store_sales.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import calendar
import os
import numpy as np
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class RossmanStoreSalesLoader(DatasetLoader):
"""The Rossmann Store Sales dataset."""
def load_unprocessed_dataframe(self, file_paths: list[str]) -> pd.DataFrame:
"""Load dataset files into a dataframe."""
stores_df = pd.read_csv(os.path.join(self.raw_dataset_dir, "store.csv"))
train_df = pd.read_csv(os.path.join(self.raw_dataset_dir, "train.csv"), low_memory=False)
train_df = preprocess_df(train_df, stores_df)
train_df["split"] = -1
train_df.loc[train_df["Year"] == 2014, "split"] = 0
train_df.loc[train_df["Year"] == 2015, "split"] = 2
train_df.drop(train_df[train_df["split"] == -1].index, inplace=True)
return train_df
def preprocess_dates(df):
# Make integer Year,Month,Day columns instead of Date
dates = np.array([[int(v) for v in s.split("-")] for s in df["Date"]])
df = df.drop(["Date"], axis=1)
df["Year"] = dates[:, 0]
df["Month"] = dates[:, 1]
df["Day"] = dates[:, 2]
return df
month_abbrs = calendar.month_abbr[1:]
month_abbrs[8] = "Sept"
def preprocess_stores(df, stores_df):
# join data in df with stores df
df = df.join(stores_df, on="Store", rsuffix="_right")
df = df.drop(["Store_right"], axis=1)
promo2_start_months = [(s.split(",") if not pd.isnull(s) else []) for s in df["PromoInterval"]]
for month_abbr in month_abbrs:
df["Promo2Start_" + month_abbr] = np.array(
[(1 if month_abbr in s else 0) for s in promo2_start_months], dtype=np.int8
)
df = df.drop(["PromoInterval"], axis=1)
return df
int_columns = [
"Store",
"DayOfWeek",
"Sales",
"Customers",
"Open",
"Promo",
"SchoolHoliday",
"Year",
"Month",
"Day",
"CompetitionDistance",
"CompetitionOpenSinceMonth",
"CompetitionOpenSinceYear",
"Promo2",
"Promo2SinceWeek",
"Promo2SinceYear",
]
def preprocess_df(df, stores_df):
df = preprocess_dates(df)
df = preprocess_stores(df, stores_df)
for column in int_columns:
df[column] = pd.to_numeric(df[column].fillna(0), downcast="integer")
df["StateHoliday"] = df["StateHoliday"].astype(str)
df.loc[df["StateHoliday"] == "0", "StateHoliday"] = "No"
return df
================================================
FILE: ludwig/datasets/loaders/santander_value_prediction.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class SantanderValuePredictionLoader(DatasetLoader):
"""The Santander Value Prediction Challenge dataset.
https://www.kaggle.com/c/santander-value-prediction-challenge
"""
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
processed_df = super().transform_dataframe(dataframe)
# Ensure feature column names are strings (some are numeric); keep special names as is
processed_df.columns = ["C" + str(col) for col in processed_df.columns]
processed_df.rename(columns={"CID": "ID", "Ctarget": "target", "Csplit": "split"}, inplace=True)
return processed_df
================================================
FILE: ludwig/datasets/loaders/sarcastic_headlines.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import pandas as pd
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class SarcasticHeadlinesLoader(DatasetLoader):
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
"""Loads a file into a dataframe."""
return pd.read_json(file_path, lines=True)
================================================
FILE: ludwig/datasets/loaders/sarcos.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import pandas as pd
from scipy.io import loadmat
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
from ludwig.utils.fs_utils import open_file
class SarcosLoader(DatasetLoader):
"""The Sarcos dataset.
Details:
The data relates to an inverse dynamics problem for a seven
degrees-of-freedom SARCOS anthropomorphic robot arm. The
task is to map from a 21-dimensional input space (7 joint
positions, 7 joint velocities, 7 joint accelerations) to the
corresponding 7 joint torques. There are 44,484 training
examples and 4,449 test examples. The first 21 columns are
the input variables, and the 22nd column is used as the target
variable.
Dataset source:
Locally Weighted Projection RegressionL: An O(n) Algorithm for
Incremental Real Time Learning in High Dimensional Space,
S. Vijayakumar and S. Schaal, Proc ICML 2000.
http://www.gaussianprocess.org/gpml/data/
"""
def load_file_to_dataframe(self, file_path: str) -> pd.DataFrame:
"""Loads a file into a dataframe."""
with open_file(file_path) as f:
mat = loadmat(f)
file_df = pd.DataFrame(mat[os.path.basename(file_path).split(".")[0]])
return file_df
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
processed_df = super().transform_dataframe(dataframe)
columns = []
columns += [f"position_{i}" for i in range(1, 8)]
columns += [f"velocity_{i}" for i in range(1, 8)]
columns += [f"acceleration_{i}" for i in range(1, 8)]
columns += [f"torque_{i}" for i in range(1, 8)]
columns += ["split"]
processed_df.columns = columns
return processed_df
================================================
FILE: ludwig/datasets/loaders/split_loaders.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import numpy as np
import pandas as pd
from ludwig.constants import SPLIT
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class RandomSplitLoader(DatasetLoader):
"""Adds a random split column to the dataset, with fixed proportions of:
train: 70%
validation: 10%
test: 20%
.
"""
def transform_dataframe(self, dataframe: pd.DataFrame) -> pd.DataFrame:
df = super().transform_dataframe(dataframe)
df[SPLIT] = np.random.choice(3, len(df), p=(0.7, 0.1, 0.2)).astype(np.int8)
return df
================================================
FILE: ludwig/datasets/loaders/sst.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import os
import pandas as pd
from ludwig.datasets.dataset_config import DatasetConfig
from ludwig.datasets.loaders.dataset_loader import DatasetLoader
class SSTLoader(DatasetLoader):
"""The SST dataset.
This dataset is constructed using the Stanford Sentiment Treebank Dataset.
This dataset contains binary labels (positive or negative) for each sample.
The original dataset specified 5 labels:
very negative, negative, neutral, positive, very positive with
the following cutoffs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
"""
def __init__(
self,
config: DatasetConfig,
cache_dir: str | None = None,
include_subtrees=False,
discard_neutral=False,
convert_parentheses=True,
remove_duplicates=False,
):
super().__init__(config, cache_dir=cache_dir)
self.include_subtrees = include_subtrees
self.discard_neutral = discard_neutral
self.convert_parentheses = convert_parentheses
self.remove_duplicates = remove_duplicates
@staticmethod
def get_sentiment_label(id2sent, phrase_id):
raise NotImplementedError
def transform_files(self, file_paths: list[str]) -> list[str]:
# maybe this should be
"""Load dataset files into a dataframe."""
sentences_df = pd.read_csv(
os.path.join(self.raw_dataset_dir, "stanfordSentimentTreebank/datasetSentences.txt"),
sep="\t",
)
sentences_df["sentence"] = sentences_df["sentence"].apply(format_text)
datasplit_df = pd.read_csv(
os.path.join(self.raw_dataset_dir, "stanfordSentimentTreebank/datasetSplit.txt"), sep=","
)
phrase2id = {}
with open(os.path.join(self.raw_dataset_dir, "stanfordSentimentTreebank/dictionary.txt")) as f:
Lines = f.readlines()
for line in Lines:
if line:
split_line = line.split("|")
phrase = split_line[0]
phrase2id[phrase] = int(split_line[1])
id2sent = {}
with open(os.path.join(self.raw_dataset_dir, "stanfordSentimentTreebank/sentiment_labels.txt")) as f:
Lines = f.readlines()
for line in Lines:
if line:
split_line = line.split("|")
try:
id2sent[int(split_line[0])] = float(split_line[1])
except ValueError:
pass
trees_pointers = None
trees_phrases = None
if self.include_subtrees:
trees_pointers = []
with open(os.path.join(self.raw_dataset_dir, "stanfordSentimentTreebank/STree.txt")) as f:
Lines = f.readlines()
for line in Lines:
if line:
trees_pointers.append([int(s.strip()) for s in line.split("|")])
trees_phrases = []
with open(os.path.join(self.raw_dataset_dir, "stanfordSentimentTreebank/SOStr.txt")) as f:
Lines = f.readlines()
for line in Lines:
if line:
trees_phrases.append([s.strip() for s in line.split("|")])
splits = {"train": 1, "test": 2, "dev": 3}
generated_csv_filenames = []
for split_name, split_id in splits.items():
sentence_idcs = get_sentence_idcs_in_split(datasplit_df, split_id)
pairs = []
if split_name == "train" and self.include_subtrees:
phrases = []
for sentence_idx in sentence_idcs:
# trees_pointers and trees_phrases are 0 indexed
# while sentence_idx starts from 1
# so we need to decrease sentence_idx value
sentence_idx -= 1
subtrees = sentence_subtrees(sentence_idx, trees_pointers, trees_phrases)
sentence_idx += 1
sentence_phrase = list(sentences_df[sentences_df["sentence_index"] == sentence_idx]["sentence"])[0]
sentence_phrase = convert_parentheses(sentence_phrase)
label = self.get_sentiment_label(id2sent, phrase2id[sentence_phrase])
# filter @ sentence level
# For SST-2, check subtrees only if sentence is not neutral
if not self.discard_neutral or label != -1:
for phrase in subtrees:
label = self.get_sentiment_label(id2sent, phrase2id[phrase])
if not self.discard_neutral or label != -1:
if not self.convert_parentheses:
phrase = convert_parentheses_back(phrase)
phrase = phrase.replace("\xa0", " ")
pairs.append([phrase, label])
else:
phrases = get_sentences_with_idcs(sentences_df, sentence_idcs)
for phrase in phrases:
phrase = convert_parentheses(phrase)
label = self.get_sentiment_label(id2sent, phrase2id[phrase])
if not self.discard_neutral or label != -1:
if not self.convert_parentheses:
phrase = convert_parentheses_back(phrase)
phrase = phrase.replace("\xa0", " ")
pairs.append([phrase, label])
final_csv = pd.DataFrame(pairs)
final_csv.columns = ["sentence", "label"]
if self.remove_duplicates:
final_csv = final_csv.drop_duplicates(subset=["sentence"])
csv_filename = os.path.join(self.raw_dataset_dir, f"{split_name}.csv")
generated_csv_filenames.append(csv_filename)
final_csv.to_csv(csv_filename, index=False)
return super().transform_files(generated_csv_filenames)
class SST2Loader(SSTLoader):
"""The SST2 dataset.
This dataset is constructed using the Stanford Sentiment Treebank Dataset.
This dataset contains binary labels (positive or negative) for each sample.
The original dataset specified 5 labels:
very negative, negative, neutral, positive, very positive with
the following cutoffs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
In the construction of this dataset, we remove all neutral phrases
and assign a negative label if the original rating falls
into the following range: [0, 0.4] and a positive label
if the original rating is between (0.6, 1.0].
"""
def __init__(
self,
config: DatasetConfig,
cache_dir: str | None = None,
include_subtrees=False,
convert_parentheses=True,
remove_duplicates=False,
):
super().__init__(
config,
cache_dir=cache_dir,
include_subtrees=include_subtrees,
discard_neutral=True,
convert_parentheses=convert_parentheses,
remove_duplicates=remove_duplicates,
)
def get_sentiment_label(self, id2sent, phrase_id):
sentiment = id2sent[phrase_id]
if sentiment <= 0.4: # negative
return 0
elif sentiment > 0.6: # positive
return 1
return -1 # neutral
class SST3Loader(SSTLoader):
"""The SST3 dataset.
This dataset is constructed using the Stanford Sentiment Treebank Dataset.
This dataset contains five labels (very negative, negative, neutral,
positive, very positive) for each sample.
In the original dataset, the 5 labels: very negative, negative, neutral, positive,
and very positive have the following cutoffs:
[0, 0.4], (0.4, 0.6], (0.6, 1.0]
This class pulls in an array of mixins for different types of functionality
which belongs in the workflow for ingesting and transforming
training data into a destination dataframe that can be use by Ludwig.
"""
def __init__(
self,
config: DatasetConfig,
cache_dir: str | None = None,
include_subtrees=False,
convert_parentheses=True,
remove_duplicates=False,
):
super().__init__(
config,
cache_dir=cache_dir,
include_subtrees=include_subtrees,
convert_parentheses=convert_parentheses,
remove_duplicates=remove_duplicates,
)
def get_sentiment_label(self, id2sent, phrase_id):
sentiment = id2sent[phrase_id]
if sentiment <= 0.4:
return "negative"
elif sentiment <= 0.6:
return "neutral"
elif sentiment <= 1.0:
return "positive"
return "neutral"
class SST5Loader(SSTLoader):
"""The SST5 dataset.
This dataset is constructed using the Stanford Sentiment Treebank Dataset.
This dataset contains five labels (very negative, negative, neutral,
positive, very positive) for each sample.
In the original dataset, the 5 labels: very negative, negative, neutral, positive,
and very positive have the following cutoffs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
This class pulls in an array of mixins for different types of functionality
which belongs in the workflow for ingesting and transforming
training data into a destination dataframe that can be use by Ludwig.
"""
def __init__(
self,
config: DatasetConfig,
cache_dir: str | None = None,
include_subtrees=False,
convert_parentheses=True,
remove_duplicates=False,
):
super().__init__(
config,
cache_dir=cache_dir,
include_subtrees=include_subtrees,
convert_parentheses=convert_parentheses,
remove_duplicates=remove_duplicates,
)
def get_sentiment_label(self, id2sent, phrase_id):
sentiment = id2sent[phrase_id]
if sentiment <= 0.2:
return "very_negative"
elif sentiment <= 0.4:
return "negative"
elif sentiment <= 0.6:
return "neutral"
elif sentiment <= 0.8:
return "positive"
elif sentiment <= 1.0:
return "very_positive"
return "neutral"
def format_text(text: str):
"""Formats text by decoding into utf-8."""
return " ".join([w.encode("latin1").decode("utf-8") for w in text.strip().split(" ")])
def convert_parentheses(text: str):
"""Replaces -LRB- and -RRB- tokens present in SST with ( and )"""
return text.replace("-LRB-", "(").replace("-RRB-", ")")
def convert_parentheses_back(text: str):
"""Replaces ( and ) tokens with -LRB- and -RRB-"""
return text.replace("(", "-LRB-").replace(")", "-RRB-")
def get_sentence_idcs_in_split(datasplit: pd.DataFrame, split_id: int):
"""Given a dataset split is (1 for train, 2 for test, 3 for dev), returns the set of corresponding sentence
indices in sentences_df."""
return set(datasplit[datasplit["splitset_label"] == split_id]["sentence_index"])
def get_sentences_with_idcs(sentences: pd.DataFrame, sentences_idcs: set[int]):
"""Given a set of sentence indices, returns the corresponding sentences texts in sentences."""
criterion = sentences["sentence_index"].map(lambda x: x in sentences_idcs)
return sentences[criterion]["sentence"].tolist()
def sentence_subtrees(sentence_idx, trees_pointers, trees_phrases):
tree_pointers = trees_pointers[sentence_idx]
tree_phrases = trees_phrases[sentence_idx]
tree = SSTTree(tree_pointers, tree_phrases)
return tree.subtrees()
def visit_postorder(node, visit_list):
if node:
visit_postorder(node.left, visit_list)
visit_postorder(node.right, visit_list)
visit_list.append(node.val)
class SSTTree:
class Node:
def __init__(self, key, val=None):
self.left = None
self.right = None
self.key = key
self.val = val
def create_node(self, parent, i):
if self.nodes[i] is not None:
# already created
return
self.nodes[i] = self.Node(i)
if parent[i] == -1:
# is root
self.root = self.nodes[i]
return
if self.nodes[parent[i]] is None:
# parent not yet created
self.create_node(parent, parent[i])
# assign current node to parent
parent = self.nodes[parent[i]]
if parent.left is None:
parent.left = self.nodes[i]
else:
parent.right = self.nodes[i]
def create_tree(self, parents, tree_phrases):
n = len(parents)
self.nodes = [None for i in range(n)]
self.root = [None]
for i in range(n):
self.create_node(parents, i)
for i, phrase in enumerate(tree_phrases):
self.nodes[i].val = phrase
for node in self.nodes:
if node.val is None:
node.val = " ".join((node.left.val, node.right.val))
def __init__(self, tree_pointers, tree_phrases):
self.create_tree([int(elem) - 1 for elem in tree_pointers], tree_phrases)
def subtrees(self):
visit_list = []
visit_postorder(self.root, visit_list)
return visit_list
================================================
FILE: ludwig/datasets/model_configs/__init__.py
================================================
================================================
FILE: ludwig/datasets/model_configs/adult_census_income_default.yaml
================================================
output_features:
- name: income
type: category
input_features:
- name: age
type: number
- name: workclass
type: category
- name: fnlwgt
type: number
- name: education
type: category
- name: education-num
type: number
- name: marital-status
type: category
- name: occupation
type: category
- name: relationship
type: category
- name: race
type: category
- name: sex
type: category
- name: capital-gain
type: number
- name: capital-loss
type: number
- name: hours-per-week
type: number
- name: native-country
type: category
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
steps_per_checkpoint: 1
================================================
FILE: ludwig/datasets/model_configs/allstate_claims_severity_default.yaml
================================================
output_features:
- name: loss
type: number
input_features:
- name: cat1
type: category
- name: cat2
type: category
- name: cat3
type: category
- name: cat4
type: category
- name: cat5
type: category
- name: cat6
type: category
- name: cat7
type: category
- name: cat8
type: category
- name: cat9
type: category
- name: cat10
type: category
- name: cat11
type: category
- name: cat12
type: category
- name: cat13
type: category
- name: cat14
type: category
- name: cat15
type: category
- name: cat16
type: category
- name: cat17
type: category
- name: cat18
type: category
- name: cat19
type: category
- name: cat20
type: category
- name: cat21
type: category
- name: cat22
type: category
- name: cat23
type: category
- name: cat24
type: category
- name: cat25
type: category
- name: cat26
type: category
- name: cat27
type: category
- name: cat28
type: category
- name: cat29
type: category
- name: cat30
type: category
- name: cat31
type: category
- name: cat32
type: category
- name: cat33
type: category
- name: cat34
type: category
- name: cat35
type: category
- name: cat36
type: category
- name: cat37
type: category
- name: cat38
type: category
- name: cat39
type: category
- name: cat40
type: category
- name: cat41
type: category
- name: cat42
type: category
- name: cat43
type: category
- name: cat44
type: category
- name: cat45
type: category
- name: cat46
type: category
- name: cat47
type: category
- name: cat48
type: category
- name: cat49
type: category
- name: cat50
type: category
- name: cat51
type: category
- name: cat52
type: category
- name: cat53
type: category
- name: cat54
type: category
- name: cat55
type: category
- name: cat56
type: category
- name: cat57
type: category
- name: cat58
type: category
- name: cat59
type: category
- name: cat60
type: category
- name: cat61
type: category
- name: cat62
type: category
- name: cat63
type: category
- name: cat64
type: category
- name: cat65
type: category
- name: cat66
type: category
- name: cat67
type: category
- name: cat68
type: category
- name: cat69
type: category
- name: cat70
type: category
- name: cat71
type: category
- name: cat72
type: category
- name: cat73
type: category
- name: cat74
type: category
- name: cat75
type: category
- name: cat76
type: category
- name: cat77
type: category
- name: cat78
type: category
- name: cat79
type: category
- name: cat80
type: category
- name: cat81
type: category
- name: cat82
type: category
- name: cat83
type: category
- name: cat84
type: category
- name: cat85
type: category
- name: cat86
type: category
- name: cat87
type: category
- name: cat88
type: category
- name: cat89
type: category
- name: cat90
type: category
- name: cat91
type: category
- name: cat92
type: category
- name: cat93
type: category
- name: cat94
type: category
- name: cat95
type: category
- name: cat96
type: category
- name: cat97
type: category
- name: cat98
type: category
- name: cat99
type: category
- name: cat100
type: category
- name: cat101
type: category
- name: cat102
type: category
- name: cat103
type: category
- name: cat104
type: category
- name: cat105
type: category
- name: cat106
type: category
- name: cat107
type: category
- name: cat108
type: category
- name: cat109
type: category
- name: cat110
type: category
- name: cat111
type: category
- name: cat112
type: category
- name: cat113
type: category
- name: cat114
type: category
- name: cat115
type: category
- name: cat116
type: category
- name: cont1
type: number
- name: cont2
type: number
- name: cont3
type: number
- name: cont4
type: number
- name: cont5
type: number
- name: cont6
type: number
- name: cont7
type: number
- name: cont8
type: number
- name: cont9
type: number
- name: cont10
type: number
- name: cont11
type: number
- name: cont12
type: number
- name: cont13
type: number
- name: cont14
type: number
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/ames_housing_default.yaml
================================================
output_features:
- name: SalePrice
type: number
input_features:
- name: MSSubClass
type: category
- name: MSZoning
type: category
- name: LotFrontage
type: number
- name: LotArea
type: number
- name: Street
type: category
- name: Alley
type: category
- name: LotShape
type: category
- name: LandContour
type: category
- name: Utilities
type: category
- name: LotConfig
type: category
- name: LandSlope
type: category
- name: Neighborhood
type: category
- name: Condition1
type: category
- name: Condition2
type: category
- name: BldgType
type: category
- name: HouseStyle
type: category
- name: OverallQual
type: category
- name: OverallCond
type: category
- name: YearBuilt
type: number
- name: YearRemodAdd
type: number
- name: RoofStyle
type: category
- name: RoofMatl
type: category
- name: Exterior1st
type: category
- name: Exterior2nd
type: category
- name: MasVnrType
type: category
- name: MasVnrArea
type: number
- name: ExterQual
type: category
- name: ExterCond
type: category
- name: Foundation
type: category
- name: BsmtQual
type: category
- name: BsmtCond
type: category
- name: BsmtExposure
type: category
- name: BsmtFinType1
type: category
- name: BsmtFinSF1
type: number
- name: BsmtFinType2
type: category
- name: BsmtFinSF2
type: number
- name: BsmtUnfSF
type: number
- name: TotalBsmtSF
type: number
- name: Heating
type: category
- name: HeatingQC
type: category
- name: CentralAir
type: binary
- name: Electrical
type: category
- name: 1stFlrSF
type: number
- name: 2ndFlrSF
type: number
- name: LowQualFinSF
type: number
- name: GrLivArea
type: number
- name: BsmtFullBath
type: number
- name: BsmtHalfBath
type: number
- name: FullBath
type: number
- name: HalfBath
type: number
- name: BedroomAbvGr
type: number
- name: KitchenAbvGr
type: number
- name: KitchenQual
type: category
- name: TotRmsAbvGrd
type: number
- name: Functional
type: category
- name: Fireplaces
type: number
- name: FireplaceQu
type: category
- name: GarageType
type: category
- name: GarageYrBlt
type: number
- name: GarageFinish
type: category
- name: GarageCars
type: number
- name: GarageArea
type: number
- name: GarageQual
type: category
- name: GarageCond
type: category
- name: PavedDrive
type: category
- name: WoodDeckSF
type: number
- name: OpenPorchSF
type: number
- name: EnclosedPorch
type: number
- name: 3SsnPorch
type: number
- name: ScreenPorch
type: number
- name: PoolArea
type: number
- name: PoolQC
type: category
- name: Fence
type: category
- name: MiscFeature
type: category
- name: MiscVal
type: number
- name: MoSold
type: category
- name: YrSold
type: number
- name: SaleType
type: category
- name: SaleCondition
type: category
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/bnp_claims_management_default.yaml
================================================
output_features:
- name: target
type: binary
input_features:
- name: v1
type: number
- name: v2
type: number
- name: v3
type: category
- name: v4
type: number
- name: v5
type: number
- name: v6
type: number
- name: v7
type: number
- name: v8
type: number
- name: v9
type: number
- name: v10
type: number
- name: v11
type: number
- name: v12
type: number
- name: v13
type: number
- name: v14
type: number
- name: v15
type: number
- name: v16
type: number
- name: v17
type: number
- name: v18
type: number
- name: v19
type: number
- name: v20
type: number
- name: v21
type: number
- name: v22
type: category
- name: v23
type: number
- name: v24
type: category
- name: v25
type: number
- name: v26
type: number
- name: v27
type: number
- name: v28
type: number
- name: v29
type: number
- name: v30
type: category
- name: v31
type: category
- name: v32
type: number
- name: v33
type: number
- name: v34
type: number
- name: v35
type: number
- name: v36
type: number
- name: v37
type: number
- name: v38
type: number
- name: v39
type: number
- name: v40
type: number
- name: v41
type: number
- name: v42
type: number
- name: v43
type: number
- name: v44
type: number
- name: v45
type: number
- name: v46
type: number
- name: v47
type: category
- name: v48
type: number
- name: v49
type: number
- name: v50
type: number
- name: v51
type: number
- name: v52
type: category
- name: v53
type: number
- name: v54
type: number
- name: v55
type: number
- name: v56
type: category
- name: v57
type: number
- name: v58
type: number
- name: v59
type: number
- name: v60
type: number
- name: v61
type: number
- name: v62
type: number
- name: v63
type: number
- name: v64
type: number
- name: v65
type: number
- name: v66
type: category
- name: v67
type: number
- name: v68
type: number
- name: v69
type: number
- name: v70
type: number
- name: v71
type: category
- name: v72
type: number
- name: v73
type: number
- name: v74
type: category
- name: v75
type: category
- name: v76
type: number
- name: v77
type: number
- name: v78
type: number
- name: v79
type: category
- name: v80
type: number
- name: v81
type: number
- name: v82
type: number
- name: v83
type: number
- name: v84
type: number
- name: v85
type: number
- name: v86
type: number
- name: v87
type: number
- name: v88
type: number
- name: v89
type: number
- name: v90
type: number
- name: v91
type: category
- name: v92
type: number
- name: v93
type: number
- name: v94
type: number
- name: v95
type: number
- name: v96
type: number
- name: v97
type: number
- name: v98
type: number
- name: v99
type: number
- name: v100
type: number
- name: v101
type: number
- name: v102
type: number
- name: v103
type: number
- name: v104
type: number
- name: v105
type: number
- name: v106
type: number
- name: v107
type: category
- name: v108
type: number
- name: v109
type: number
- name: v110
type: category
- name: v111
type: number
- name: v112
type: category
- name: v113
type: category
- name: v114
type: number
- name: v115
type: number
- name: v116
type: number
- name: v117
type: number
- name: v118
type: number
- name: v119
type: number
- name: v120
type: number
- name: v121
type: number
- name: v122
type: number
- name: v123
type: number
- name: v124
type: number
- name: v125
type: category
- name: v126
type: number
- name: v127
type: number
- name: v128
type: number
- name: v129
type: number
- name: v130
type: number
- name: v131
type: number
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/forest_cover_default.yaml
================================================
output_features:
- name: Cover_Type
type: category
input_features:
- name: Elevation
type: number
- name: Aspect
type: number
- name: Slope
type: number
- name: Horizontal_Distance_To_Hydrology
type: number
- name: Vertical_Distance_To_Hydrology
type: number
- name: Horizontal_Distance_To_Roadways
type: number
- name: Hillshade_9am
type: number
- name: Hillshade_Noon
type: number
- name: Hillshade_3pm
type: number
- name: Horizontal_Distance_To_Fire_Points
type: number
- name: Wilderness_Area
type: category
- name: Soil_Type
type: category
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/higgs_best.yaml
================================================
output_features:
- name: label
type: binary
weight_regularization: null
input_features:
- name: lepton_pT
type: number
- name: lepton_eta
type: number
- name: lepton_phi
type: number
- name: missing_energy_magnitude
type: number
- name: missing_energy_phi
type: number
- name: jet_1_pt
type: number
- name: jet_1_eta
type: number
- name: jet_1_phi
type: number
- name: jet_1_b-tag
type: number
- name: jet_2_pt
type: number
- name: jet_2_eta
type: number
- name: jet_2_phi
type: number
- name: jet_2_b-tag
type: number
- name: jet_3_pt
type: number
- name: jet_3_eta
type: number
- name: jet_3_phi
type: number
- name: jet_3_b-tag
type: number
- name: jet_4_pt
type: number
- name: jet_4_eta
type: number
- name: jet_4_phi
type: number
- name: jet_4_b-tag
type: number
- name: m_jj
type: number
- name: m_jjj
type: number
- name: m_lv
type: number
- name: m_jlv
type: number
- name: m_bb
type: number
- name: m_wbb
type: number
- name: m_wwbb
type: number
combiner:
type: tabnet
bn_momentum: 0.95
bn_virtual_bs: 1024
dropout: 0.05252744300130521
fc_size: 128
num_fc_layers: 3
num_steps: 3
output_size: 128
relaxation_factor: 1.5
size: 32
sparsity: 0.0001
training:
batch_size: 8192
learning_rate: 0.01
shuffle_buffer_size: 1000000
should_shuffle: true
eval_batch_size: 500000 #4096 # 65536 131072 262144 524288
epochs: 300
early_stop: 30
optimizer:
type: adam
learning_rate_scheduler:
decay: exponential
decay_rate: 0.8
decay_steps: 20000
regularization_lambda: 1
validation_field: label
================================================
FILE: ludwig/datasets/model_configs/higgs_default.yaml
================================================
output_features:
- name: label
type: binary
weight_regularization: null
input_features:
- name: lepton_pT
type: number
- name: lepton_eta
type: number
- name: lepton_phi
type: number
- name: missing_energy_magnitude
type: number
- name: missing_energy_phi
type: number
- name: jet_1_pt
type: number
- name: jet_1_eta
type: number
- name: jet_1_phi
type: number
- name: jet_1_b-tag
type: number
- name: jet_2_pt
type: number
- name: jet_2_eta
type: number
- name: jet_2_phi
type: number
- name: jet_2_b-tag
type: number
- name: jet_3_pt
type: number
- name: jet_3_eta
type: number
- name: jet_3_phi
type: number
- name: jet_3_b-tag
type: number
- name: jet_4_pt
type: number
- name: jet_4_eta
type: number
- name: jet_4_phi
type: number
- name: jet_4_b-tag
type: number
- name: m_jj
type: number
- name: m_jjj
type: number
- name: m_lv
type: number
- name: m_jlv
type: number
- name: m_bb
type: number
- name: m_wbb
type: number
- name: m_wwbb
type: number
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/ieee_fraud_default.yaml
================================================
output_features:
- name: isFraud
type: binary
input_features:
- name: TransactionDT
type: number
- name: TransactionAmt
type: number
- name: ProductCD
type: category
- name: card1
type: number
- name: card2
type: number
- name: card3
type: number
- name: card4
type: category
- name: card5
type: number
- name: card6
type: category
- name: addr1
type: number
- name: addr2
type: number
- name: dist1
type: number
- name: dist2
type: number
- name: P_emaildomain
type: category
- name: R_emaildomain
type: number
- name: C1
type: number
- name: C2
type: number
- name: C3
type: number
- name: C4
type: number
- name: C5
type: number
- name: C6
type: number
- name: C7
type: number
- name: C8
type: number
- name: C9
type: number
- name: C10
type: number
- name: C11
type: number
- name: C12
type: number
- name: C13
type: number
- name: C14
type: number
- name: D1
type: number
- name: D2
type: number
- name: D3
type: number
- name: D4
type: number
- name: D5
type: number
- name: D6
type: number
- name: D7
type: number
- name: D8
type: number
- name: D9
type: number
- name: D10
type: number
- name: D11
type: number
- name: D12
type: number
- name: D13
type: number
- name: D14
type: number
- name: D15
type: number
- name: M1
type: category
- name: M2
type: category
- name: M3
type: category
- name: M4
type: category
- name: M5
type: category
- name: M6
type: category
- name: M7
type: category
- name: M8
type: category
- name: M9
type: category
- name: V1
type: number
- name: V2
type: number
- name: V3
type: number
- name: V4
type: number
- name: V5
type: number
- name: V6
type: number
- name: V7
type: number
- name: V8
type: number
- name: V9
type: number
- name: V10
type: number
- name: V11
type: number
- name: V12
type: number
- name: V13
type: number
- name: V14
type: number
- name: V15
type: number
- name: V16
type: number
- name: V17
type: number
- name: V18
type: number
- name: V19
type: number
- name: V20
type: number
- name: V21
type: number
- name: V22
type: number
- name: V23
type: number
- name: V24
type: number
- name: V25
type: number
- name: V26
type: number
- name: V27
type: number
- name: V28
type: number
- name: V29
type: number
- name: V30
type: number
- name: V31
type: number
- name: V32
type: number
- name: V33
type: number
- name: V34
type: number
- name: V35
type: number
- name: V36
type: number
- name: V37
type: number
- name: V38
type: number
- name: V39
type: number
- name: V40
type: number
- name: V41
type: number
- name: V42
type: number
- name: V43
type: number
- name: V44
type: number
- name: V45
type: number
- name: V46
type: number
- name: V47
type: number
- name: V48
type: number
- name: V49
type: number
- name: V50
type: number
- name: V51
type: number
- name: V52
type: number
- name: V53
type: number
- name: V54
type: number
- name: V55
type: number
- name: V56
type: number
- name: V57
type: number
- name: V58
type: number
- name: V59
type: number
- name: V60
type: number
- name: V61
type: number
- name: V62
type: number
- name: V63
type: number
- name: V64
type: number
- name: V65
type: number
- name: V66
type: number
- name: V67
type: number
- name: V68
type: number
- name: V69
type: number
- name: V70
type: number
- name: V71
type: number
- name: V72
type: number
- name: V73
type: number
- name: V74
type: number
- name: V75
type: number
- name: V76
type: number
- name: V77
type: number
- name: V78
type: number
- name: V79
type: number
- name: V80
type: number
- name: V81
type: number
- name: V82
type: number
- name: V83
type: number
- name: V84
type: number
- name: V85
type: number
- name: V86
type: number
- name: V87
type: number
- name: V88
type: number
- name: V89
type: number
- name: V90
type: number
- name: V91
type: number
- name: V92
type: number
- name: V93
type: number
- name: V94
type: number
- name: V95
type: number
- name: V96
type: number
- name: V97
type: number
- name: V98
type: number
- name: V99
type: number
- name: V100
type: number
- name: V101
type: number
- name: V102
type: number
- name: V103
type: number
- name: V104
type: number
- name: V105
type: number
- name: V106
type: number
- name: V107
type: number
- name: V108
type: number
- name: V109
type: number
- name: V110
type: number
- name: V111
type: number
- name: V112
type: number
- name: V113
type: number
- name: V114
type: number
- name: V115
type: number
- name: V116
type: number
- name: V117
type: number
- name: V118
type: number
- name: V119
type: number
- name: V120
type: number
- name: V121
type: number
- name: V122
type: number
- name: V123
type: number
- name: V124
type: number
- name: V125
type: number
- name: V126
type: number
- name: V127
type: number
- name: V128
type: number
- name: V129
type: number
- name: V130
type: number
- name: V131
type: number
- name: V132
type: number
- name: V133
type: number
- name: V134
type: number
- name: V135
type: number
- name: V136
type: number
- name: V137
type: number
- name: V138
type: number
- name: V139
type: number
- name: V140
type: number
- name: V141
type: number
- name: V142
type: number
- name: V143
type: number
- name: V144
type: number
- name: V145
type: number
- name: V146
type: number
- name: V147
type: number
- name: V148
type: number
- name: V149
type: number
- name: V150
type: number
- name: V151
type: number
- name: V152
type: number
- name: V153
type: number
- name: V154
type: number
- name: V155
type: number
- name: V156
type: number
- name: V157
type: number
- name: V158
type: number
- name: V159
type: number
- name: V160
type: number
- name: V161
type: number
- name: V162
type: number
- name: V163
type: number
- name: V164
type: number
- name: V165
type: number
- name: V166
type: number
- name: V167
type: number
- name: V168
type: number
- name: V169
type: number
- name: V170
type: number
- name: V171
type: number
- name: V172
type: number
- name: V173
type: number
- name: V174
type: number
- name: V175
type: number
- name: V176
type: number
- name: V177
type: number
- name: V178
type: number
- name: V179
type: number
- name: V180
type: number
- name: V181
type: number
- name: V182
type: number
- name: V183
type: number
- name: V184
type: number
- name: V185
type: number
- name: V186
type: number
- name: V187
type: number
- name: V188
type: number
- name: V189
type: number
- name: V190
type: number
- name: V191
type: number
- name: V192
type: number
- name: V193
type: number
- name: V194
type: number
- name: V195
type: number
- name: V196
type: number
- name: V197
type: number
- name: V198
type: number
- name: V199
type: number
- name: V200
type: number
- name: V201
type: number
- name: V202
type: number
- name: V203
type: number
- name: V204
type: number
- name: V205
type: number
- name: V206
type: number
- name: V207
type: number
- name: V208
type: number
- name: V209
type: number
- name: V210
type: number
- name: V211
type: number
- name: V212
type: number
- name: V213
type: number
- name: V214
type: number
- name: V215
type: number
- name: V216
type: number
- name: V217
type: number
- name: V218
type: number
- name: V219
type: number
- name: V220
type: number
- name: V221
type: number
- name: V222
type: number
- name: V223
type: number
- name: V224
type: number
- name: V225
type: number
- name: V226
type: number
- name: V227
type: number
- name: V228
type: number
- name: V229
type: number
- name: V230
type: number
- name: V231
type: number
- name: V232
type: number
- name: V233
type: number
- name: V234
type: number
- name: V235
type: number
- name: V236
type: number
- name: V237
type: number
- name: V238
type: number
- name: V239
type: number
- name: V240
type: number
- name: V241
type: number
- name: V242
type: number
- name: V243
type: number
- name: V244
type: number
- name: V245
type: number
- name: V246
type: number
- name: V247
type: number
- name: V248
type: number
- name: V249
type: number
- name: V250
type: number
- name: V251
type: number
- name: V252
type: number
- name: V253
type: number
- name: V254
type: number
- name: V255
type: number
- name: V256
type: number
- name: V257
type: number
- name: V258
type: number
- name: V259
type: number
- name: V260
type: number
- name: V261
type: number
- name: V262
type: number
- name: V263
type: number
- name: V264
type: number
- name: V265
type: number
- name: V266
type: number
- name: V267
type: number
- name: V268
type: number
- name: V269
type: number
- name: V270
type: number
- name: V271
type: number
- name: V272
type: number
- name: V273
type: number
- name: V274
type: number
- name: V275
type: number
- name: V276
type: number
- name: V277
type: number
- name: V278
type: number
- name: V279
type: number
- name: V280
type: number
- name: V281
type: number
- name: V282
type: number
- name: V283
type: number
- name: V284
type: number
- name: V285
type: number
- name: V286
type: number
- name: V287
type: number
- name: V288
type: number
- name: V289
type: number
- name: V290
type: number
- name: V291
type: number
- name: V292
type: number
- name: V293
type: number
- name: V294
type: number
- name: V295
type: number
- name: V296
type: number
- name: V297
type: number
- name: V298
type: number
- name: V299
type: number
- name: V300
type: number
- name: V301
type: number
- name: V302
type: number
- name: V303
type: number
- name: V304
type: number
- name: V305
type: number
- name: V306
type: number
- name: V307
type: number
- name: V308
type: number
- name: V309
type: number
- name: V310
type: number
- name: V311
type: number
- name: V312
type: number
- name: V313
type: number
- name: V314
type: number
- name: V315
type: number
- name: V316
type: number
- name: V317
type: number
- name: V318
type: number
- name: V319
type: number
- name: V320
type: number
- name: V321
type: number
- name: V322
type: number
- name: V323
type: number
- name: V324
type: number
- name: V325
type: number
- name: V326
type: number
- name: V327
type: number
- name: V328
type: number
- name: V329
type: number
- name: V330
type: number
- name: V331
type: number
- name: V332
type: number
- name: V333
type: number
- name: V334
type: number
- name: V335
type: number
- name: V336
type: number
- name: V337
type: number
- name: V338
type: number
- name: V339
type: number
- name: id_01
type: number
- name: id_02
type: number
- name: id_03
type: number
- name: id_04
type: number
- name: id_05
type: number
- name: id_06
type: number
- name: id_07
type: number
- name: id_08
type: number
- name: id_09
type: number
- name: id_10
type: number
- name: id_11
type: number
- name: id_12
type: number
- name: id_13
type: number
- name: id_14
type: number
- name: id_15
type: number
- name: id_16
type: number
- name: id_17
type: number
- name: id_18
type: number
- name: id_19
type: number
- name: id_20
type: number
- name: id_21
type: number
- name: id_22
type: number
- name: id_23
type: number
- name: id_24
type: number
- name: id_25
type: number
- name: id_26
type: number
- name: id_27
type: number
- name: id_28
type: number
- name: id_29
type: number
- name: id_30
type: number
- name: id_31
type: number
- name: id_32
type: number
- name: id_33
type: number
- name: id_34
type: number
- name: id_35
type: number
- name: id_36
type: number
- name: id_37
type: number
- name: id_38
type: number
- name: DeviceType
type: number
- name: DeviceInfo
type: number
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/mercedes_benz_greener_default.yaml
================================================
output_features:
- name: y
type: number
input_features:
- name: X0
type: category
- name: X1
type: category
- name: X2
type: category
- name: X3
type: category
- name: X4
type: category
- name: X5
type: category
- name: X6
type: category
- name: X8
type: category
- name: X10
type: binary
- name: X11
type: binary
- name: X12
type: binary
- name: X13
type: binary
- name: X14
type: binary
- name: X15
type: binary
- name: X16
type: binary
- name: X17
type: binary
- name: X18
type: binary
- name: X19
type: binary
- name: X20
type: binary
- name: X21
type: binary
- name: X22
type: binary
- name: X23
type: binary
- name: X24
type: binary
- name: X26
type: binary
- name: X27
type: binary
- name: X28
type: binary
- name: X29
type: binary
- name: X30
type: binary
- name: X31
type: binary
- name: X32
type: binary
- name: X33
type: binary
- name: X34
type: binary
- name: X35
type: binary
- name: X36
type: binary
- name: X37
type: binary
- name: X38
type: binary
- name: X39
type: binary
- name: X40
type: binary
- name: X41
type: binary
- name: X42
type: binary
- name: X43
type: binary
- name: X44
type: binary
- name: X45
type: binary
- name: X46
type: binary
- name: X47
type: binary
- name: X48
type: binary
- name: X49
type: binary
- name: X50
type: binary
- name: X51
type: binary
- name: X52
type: binary
- name: X53
type: binary
- name: X54
type: binary
- name: X55
type: binary
- name: X56
type: binary
- name: X57
type: binary
- name: X58
type: binary
- name: X59
type: binary
- name: X60
type: binary
- name: X61
type: binary
- name: X62
type: binary
- name: X63
type: binary
- name: X64
type: binary
- name: X65
type: binary
- name: X66
type: binary
- name: X67
type: binary
- name: X68
type: binary
- name: X69
type: binary
- name: X70
type: binary
- name: X71
type: binary
- name: X73
type: binary
- name: X74
type: binary
- name: X75
type: binary
- name: X76
type: binary
- name: X77
type: binary
- name: X78
type: binary
- name: X79
type: binary
- name: X80
type: binary
- name: X81
type: binary
- name: X82
type: binary
- name: X83
type: binary
- name: X84
type: binary
- name: X85
type: binary
- name: X86
type: binary
- name: X87
type: binary
- name: X88
type: binary
- name: X89
type: binary
- name: X90
type: binary
- name: X91
type: binary
- name: X92
type: binary
- name: X93
type: binary
- name: X94
type: binary
- name: X95
type: binary
- name: X96
type: binary
- name: X97
type: binary
- name: X98
type: binary
- name: X99
type: binary
- name: X100
type: binary
- name: X101
type: binary
- name: X102
type: binary
- name: X103
type: binary
- name: X104
type: binary
- name: X105
type: binary
- name: X106
type: binary
- name: X107
type: binary
- name: X108
type: binary
- name: X109
type: binary
- name: X110
type: binary
- name: X111
type: binary
- name: X112
type: binary
- name: X113
type: binary
- name: X114
type: binary
- name: X115
type: binary
- name: X116
type: binary
- name: X117
type: binary
- name: X118
type: binary
- name: X119
type: binary
- name: X120
type: binary
- name: X122
type: binary
- name: X123
type: binary
- name: X124
type: binary
- name: X125
type: binary
- name: X126
type: binary
- name: X127
type: binary
- name: X128
type: binary
- name: X129
type: binary
- name: X130
type: binary
- name: X131
type: binary
- name: X132
type: binary
- name: X133
type: binary
- name: X134
type: binary
- name: X135
type: binary
- name: X136
type: binary
- name: X137
type: binary
- name: X138
type: binary
- name: X139
type: binary
- name: X140
type: binary
- name: X141
type: binary
- name: X142
type: binary
- name: X143
type: binary
- name: X144
type: binary
- name: X145
type: binary
- name: X146
type: binary
- name: X147
type: binary
- name: X148
type: binary
- name: X150
type: binary
- name: X151
type: binary
- name: X152
type: binary
- name: X153
type: binary
- name: X154
type: binary
- name: X155
type: binary
- name: X156
type: binary
- name: X157
type: binary
- name: X158
type: binary
- name: X159
type: binary
- name: X160
type: binary
- name: X161
type: binary
- name: X162
type: binary
- name: X163
type: binary
- name: X164
type: binary
- name: X165
type: binary
- name: X166
type: binary
- name: X167
type: binary
- name: X168
type: binary
- name: X169
type: binary
- name: X170
type: binary
- name: X171
type: binary
- name: X172
type: binary
- name: X173
type: binary
- name: X174
type: binary
- name: X175
type: binary
- name: X176
type: binary
- name: X177
type: binary
- name: X178
type: binary
- name: X179
type: binary
- name: X180
type: binary
- name: X181
type: binary
- name: X182
type: binary
- name: X183
type: binary
- name: X184
type: binary
- name: X185
type: binary
- name: X186
type: binary
- name: X187
type: binary
- name: X189
type: binary
- name: X190
type: binary
- name: X191
type: binary
- name: X192
type: binary
- name: X194
type: binary
- name: X195
type: binary
- name: X196
type: binary
- name: X197
type: binary
- name: X198
type: binary
- name: X199
type: binary
- name: X200
type: binary
- name: X201
type: binary
- name: X202
type: binary
- name: X203
type: binary
- name: X204
type: binary
- name: X205
type: binary
- name: X206
type: binary
- name: X207
type: binary
- name: X208
type: binary
- name: X209
type: binary
- name: X210
type: binary
- name: X211
type: binary
- name: X212
type: binary
- name: X213
type: binary
- name: X214
type: binary
- name: X215
type: binary
- name: X216
type: binary
- name: X217
type: binary
- name: X218
type: binary
- name: X219
type: binary
- name: X220
type: binary
- name: X221
type: binary
- name: X222
type: binary
- name: X223
type: binary
- name: X224
type: binary
- name: X225
type: binary
- name: X226
type: binary
- name: X227
type: binary
- name: X228
type: binary
- name: X229
type: binary
- name: X230
type: binary
- name: X231
type: binary
- name: X232
type: binary
- name: X233
type: binary
- name: X234
type: binary
- name: X235
type: binary
- name: X236
type: binary
- name: X237
type: binary
- name: X238
type: binary
- name: X239
type: binary
- name: X240
type: binary
- name: X241
type: binary
- name: X242
type: binary
- name: X243
type: binary
- name: X244
type: binary
- name: X245
type: binary
- name: X246
type: binary
- name: X247
type: binary
- name: X248
type: binary
- name: X249
type: binary
- name: X250
type: binary
- name: X251
type: binary
- name: X252
type: binary
- name: X253
type: binary
- name: X254
type: binary
- name: X255
type: binary
- name: X256
type: binary
- name: X257
type: binary
- name: X258
type: binary
- name: X259
type: binary
- name: X260
type: binary
- name: X261
type: binary
- name: X262
type: binary
- name: X263
type: binary
- name: X264
type: binary
- name: X265
type: binary
- name: X266
type: binary
- name: X267
type: binary
- name: X268
type: binary
- name: X269
type: binary
- name: X270
type: binary
- name: X271
type: binary
- name: X272
type: binary
- name: X273
type: binary
- name: X274
type: binary
- name: X275
type: binary
- name: X276
type: binary
- name: X277
type: binary
- name: X278
type: binary
- name: X279
type: binary
- name: X280
type: binary
- name: X281
type: binary
- name: X282
type: binary
- name: X283
type: binary
- name: X284
type: binary
- name: X285
type: binary
- name: X286
type: binary
- name: X287
type: binary
- name: X288
type: binary
- name: X289
type: binary
- name: X290
type: binary
- name: X291
type: binary
- name: X292
type: binary
- name: X293
type: binary
- name: X294
type: binary
- name: X295
type: binary
- name: X296
type: binary
- name: X297
type: binary
- name: X298
type: binary
- name: X299
type: binary
- name: X300
type: binary
- name: X301
type: binary
- name: X302
type: binary
- name: X304
type: binary
- name: X305
type: binary
- name: X306
type: binary
- name: X307
type: binary
- name: X308
type: binary
- name: X309
type: binary
- name: X310
type: binary
- name: X311
type: binary
- name: X312
type: binary
- name: X313
type: binary
- name: X314
type: binary
- name: X315
type: binary
- name: X316
type: binary
- name: X317
type: binary
- name: X318
type: binary
- name: X319
type: binary
- name: X320
type: binary
- name: X321
type: binary
- name: X322
type: binary
- name: X323
type: binary
- name: X324
type: binary
- name: X325
type: binary
- name: X326
type: binary
- name: X327
type: binary
- name: X328
type: binary
- name: X329
type: binary
- name: X330
type: binary
- name: X331
type: binary
- name: X332
type: binary
- name: X333
type: binary
- name: X334
type: binary
- name: X335
type: binary
- name: X336
type: binary
- name: X337
type: binary
- name: X338
type: binary
- name: X339
type: binary
- name: X340
type: binary
- name: X341
type: binary
- name: X342
type: binary
- name: X343
type: binary
- name: X344
type: binary
- name: X345
type: binary
- name: X346
type: binary
- name: X347
type: binary
- name: X348
type: binary
- name: X349
type: binary
- name: X350
type: binary
- name: X351
type: binary
- name: X352
type: binary
- name: X353
type: binary
- name: X354
type: binary
- name: X355
type: binary
- name: X356
type: binary
- name: X357
type: binary
- name: X358
type: binary
- name: X359
type: binary
- name: X360
type: binary
- name: X361
type: binary
- name: X362
type: binary
- name: X363
type: binary
- name: X364
type: binary
- name: X365
type: binary
- name: X366
type: binary
- name: X367
type: binary
- name: X368
type: binary
- name: X369
type: binary
- name: X370
type: binary
- name: X371
type: binary
- name: X372
type: binary
- name: X373
type: binary
- name: X374
type: binary
- name: X375
type: binary
- name: X376
type: binary
- name: X377
type: binary
- name: X378
type: binary
- name: X379
type: binary
- name: X380
type: binary
- name: X382
type: binary
- name: X383
type: binary
- name: X384
type: binary
- name: X385
type: binary
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/mnist_default.yaml
================================================
output_features:
- name: label
type: category
input_features:
- name: image_path
type: image
preprocessing:
num_processes: 4
encoder: stacked_cnn
conv_layers:
- num_filters: 32
filter_size: 3
pool_size: 2
pool_stride: 2
- num_filters: 64
filter_size: 3
pool_size: 2
pool_stride: 2
dropout: 0.4
fc_layers:
- output_size: 128
dropout: 0.4
trainer:
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/mushroom_edibility_default.yaml
================================================
output_features:
- name: class
type: category
input_features:
- name: cap-shape
type: category
- name: cap-surface
type: category
- name: cap-color
type: category
- name: bruises?
type: category
- name: odor
type: category
- name: gill-attachment
type: category
- name: gill-spacing
type: category
- name: gill-size
type: category
- name: gill-color
type: category
- name: stalk-shape
type: category
- name: stalk-root
type: category
- name: stalk-surface-above-ring
type: category
- name: stalk-surface-below-ring
type: category
- name: stalk-color-above-ring
type: category
- name: stalk-color-below-ring
type: category
- name: veil-type
type: category
- name: veil-color
type: category
- name: ring-number
type: category
- name: ring-type
type: category
- name: spore-print-color
type: category
- name: population
type: category
- name: habitat
type: category
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/otto_group_product_default.yaml
================================================
output_features:
- name: target
type: category
input_features:
- name: feat_1
type: number
- name: feat_2
type: number
- name: feat_3
type: number
- name: feat_4
type: number
- name: feat_5
type: number
- name: feat_6
type: number
- name: feat_7
type: number
- name: feat_8
type: number
- name: feat_9
type: number
- name: feat_10
type: number
- name: feat_11
type: number
- name: feat_12
type: number
- name: feat_13
type: number
- name: feat_14
type: number
- name: feat_15
type: number
- name: feat_16
type: number
- name: feat_17
type: number
- name: feat_18
type: number
- name: feat_19
type: number
- name: feat_20
type: number
- name: feat_21
type: number
- name: feat_22
type: number
- name: feat_23
type: number
- name: feat_24
type: number
- name: feat_25
type: number
- name: feat_26
type: number
- name: feat_27
type: number
- name: feat_28
type: number
- name: feat_29
type: number
- name: feat_30
type: number
- name: feat_31
type: number
- name: feat_32
type: number
- name: feat_33
type: number
- name: feat_34
type: number
- name: feat_35
type: number
- name: feat_36
type: number
- name: feat_37
type: number
- name: feat_38
type: number
- name: feat_39
type: number
- name: feat_40
type: number
- name: feat_41
type: number
- name: feat_42
type: number
- name: feat_43
type: number
- name: feat_44
type: number
- name: feat_45
type: number
- name: feat_46
type: number
- name: feat_47
type: number
- name: feat_48
type: number
- name: feat_49
type: number
- name: feat_50
type: number
- name: feat_51
type: number
- name: feat_52
type: number
- name: feat_53
type: number
- name: feat_54
type: number
- name: feat_55
type: number
- name: feat_56
type: number
- name: feat_57
type: number
- name: feat_58
type: number
- name: feat_59
type: number
- name: feat_60
type: number
- name: feat_61
type: number
- name: feat_62
type: number
- name: feat_63
type: number
- name: feat_64
type: number
- name: feat_65
type: number
- name: feat_66
type: number
- name: feat_67
type: number
- name: feat_68
type: number
- name: feat_69
type: number
- name: feat_70
type: number
- name: feat_71
type: number
- name: feat_72
type: number
- name: feat_73
type: number
- name: feat_74
type: number
- name: feat_75
type: number
- name: feat_76
type: number
- name: feat_77
type: number
- name: feat_78
type: number
- name: feat_79
type: number
- name: feat_80
type: number
- name: feat_81
type: number
- name: feat_82
type: number
- name: feat_83
type: number
- name: feat_84
type: number
- name: feat_85
type: number
- name: feat_86
type: number
- name: feat_87
type: number
- name: feat_88
type: number
- name: feat_89
type: number
- name: feat_90
type: number
- name: feat_91
type: number
- name: feat_92
type: number
- name: feat_93
type: number
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/poker_hand_default.yaml
================================================
output_features:
- name: hand
type: category
input_features:
- name: S1
type: category
- name: C1
type: category
- name: S2
type: category
- name: C2
type: category
- name: S3
type: category
- name: C3
type: category
- name: S4
type: category
- name: C4
type: category
- name: S5
type: category
- name: C5
type: category
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/porto_seguro_safe_driver_default.yaml
================================================
output_features:
- name: target
type: binary
input_features:
- name: ps_ind_01
type: number
- name: ps_ind_02_cat
type: category
- name: ps_ind_03
type: number
- name: ps_ind_04_cat
type: category
- name: ps_ind_05_cat
type: category
- name: ps_ind_06_bin
type: binary
- name: ps_ind_07_bin
type: binary
- name: ps_ind_08_bin
type: binary
- name: ps_ind_09_bin
type: binary
- name: ps_ind_10_bin
type: binary
- name: ps_ind_11_bin
type: binary
- name: ps_ind_12_bin
type: binary
- name: ps_ind_13_bin
type: binary
- name: ps_ind_14
type: number
- name: ps_ind_15
type: number
- name: ps_ind_16_bin
type: binary
- name: ps_ind_17_bin
type: binary
- name: ps_ind_18_bin
type: binary
- name: ps_reg_01
type: number
- name: ps_reg_02
type: number
- name: ps_reg_03
type: number
- name: ps_car_01_cat
type: category
- name: ps_car_02_cat
type: category
- name: ps_car_03_cat
type: category
- name: ps_car_04_cat
type: category
- name: ps_car_05_cat
type: category
- name: ps_car_06_cat
type: category
- name: ps_car_07_cat
type: category
- name: ps_car_08_cat
type: category
- name: ps_car_09_cat
type: category
- name: ps_car_10_cat
type: category
- name: ps_car_11_cat
type: category
- name: ps_car_11
type: number
- name: ps_car_12
type: number
- name: ps_car_13
type: number
- name: ps_car_14
type: number
- name: ps_car_15
type: number
- name: ps_calc_01
type: number
- name: ps_calc_02
type: number
- name: ps_calc_03
type: number
- name: ps_calc_04
type: number
- name: ps_calc_05
type: number
- name: ps_calc_06
type: number
- name: ps_calc_07
type: number
- name: ps_calc_08
type: number
- name: ps_calc_09
type: number
- name: ps_calc_10
type: number
- name: ps_calc_11
type: number
- name: ps_calc_12
type: number
- name: ps_calc_13
type: number
- name: ps_calc_14
type: number
- name: ps_calc_15_bin
type: binary
- name: ps_calc_16_bin
type: binary
- name: ps_calc_17_bin
type: binary
- name: ps_calc_18_bin
type: binary
- name: ps_calc_19_bin
type: binary
- name: ps_calc_20_bin
type: binary
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/synthetic_fraud_default.yaml
================================================
output_features:
- name: isFraud
type: binary
input_features:
- name: step
type: number
- name: type
type: category
- name: amount
type: number
- name: oldbalanceOrg
type: number
- name: newbalanceOrig
type: number
- name: oldbalanceDest
type: number
- name: newbalanceDest
type: number
combiner:
type: concat
num_fc_layers: 3
fc_size: 128
dropout: 0.1
training:
batch_size: 256
learning_rate: .001
epochs: 1
================================================
FILE: ludwig/datasets/model_configs/titanic_default.yaml
================================================
output_features:
- name: Survived
type: binary
input_features:
- name: Pclass
type: category
- name: Sex
type: category
- name: Age
type: number
preprocessing:
missing_value_strategy: fill_with_mean
- name: SibSp
type: number
- name: Parch
type: number
- name: Fare
type: number
preprocessing:
missing_value_strategy: fill_with_mean
- name: Embarked
type: category
training:
batch_size: 256
epochs: 1
================================================
FILE: ludwig/datasets/utils.py
================================================
import os
from functools import lru_cache
import yaml
from ludwig.api_annotations import PublicAPI
from ludwig.datasets import model_configs
@PublicAPI
def model_configs_for_dataset(dataset_name: str) -> dict[str, dict]:
"""Returns a dictionary of built-in model configs for the specified dataset.
Maps config name to ludwig config dict.
"""
return _get_model_configs(dataset_name)
@lru_cache(maxsize=3)
def _get_model_configs(dataset_name: str) -> dict[str, dict]:
"""Returns all model configs for the specified dataset.
Model configs are named _.yaml
"""
import importlib.resources
config_filenames = [
f.name
for f in importlib.resources.files(model_configs).iterdir()
if f.name.endswith(".yaml") and f.name.startswith(dataset_name)
]
configs = {}
for config_filename in config_filenames:
basename = os.path.splitext(config_filename)[0]
config_name = basename[len(dataset_name) + 1 :]
configs[config_name] = _load_model_config(config_filename)
return configs
def _load_model_config(model_config_filename: str):
"""Loads a model config."""
model_config_path = os.path.join(os.path.dirname(model_configs.__file__), model_config_filename)
with open(model_config_path) as f:
return yaml.safe_load(f)
================================================
FILE: ludwig/decoders/__init__.py
================================================
# register all decoders
import ludwig.decoders.generic_decoders # noqa
import ludwig.decoders.image_decoders # noqa
import ludwig.decoders.llm_decoders # noqa
import ludwig.decoders.sequence_decoders # noqa
import ludwig.decoders.sequence_tagger # noqa
================================================
FILE: ludwig/decoders/base.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from abc import ABC, abstractmethod
from ludwig.api_annotations import DeveloperAPI
from ludwig.utils.torch_utils import LudwigModule
@DeveloperAPI
class Decoder(LudwigModule, ABC):
@abstractmethod
def forward(self, inputs, mask=None):
raise NotImplementedError
@property
def name(self):
return self.__class__.__name__
================================================
FILE: ludwig/decoders/generic_decoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from functools import partial
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BINARY, CATEGORY, CATEGORY_DISTRIBUTION, LOSS, NUMBER, SET, TIMESERIES, TYPE, VECTOR
from ludwig.decoders.base import Decoder
from ludwig.decoders.registry import register_decoder
from ludwig.schema.decoders.base import ClassifierConfig, PassthroughDecoderConfig, ProjectorConfig, RegressorConfig
from ludwig.utils.torch_utils import Dense, get_activation
logger = logging.getLogger(__name__)
@DeveloperAPI
# TODO(Arnav): Re-enable once we add DotProduct Combiner: https://github.com/ludwig-ai/ludwig/issues/3150
# @register_decoder("passthrough", [BINARY, CATEGORY, NUMBER, SET, VECTOR, SEQUENCE, TEXT])
class PassthroughDecoder(Decoder):
def __init__(self, input_size: int = 1, num_classes: int = None, decoder_config=None, **kwargs):
super().__init__()
self.config = decoder_config
logger.debug(f" {self.name}")
self.input_size = input_size
self.num_classes = num_classes
def forward(self, inputs, **kwargs):
return inputs
@staticmethod
def get_schema_cls():
return PassthroughDecoderConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
return self.input_shape
@DeveloperAPI
@register_decoder("regressor", [BINARY, NUMBER])
class Regressor(Decoder):
def __init__(
self,
input_size,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
decoder_config=None,
**kwargs,
):
super().__init__()
self.config = decoder_config
logger.debug(f" {self.name}")
logger.debug(" Dense")
self.dense = Dense(
input_size=input_size,
output_size=1,
use_bias=use_bias,
weights_initializer=weights_initializer,
bias_initializer=bias_initializer,
)
@staticmethod
def get_schema_cls():
return RegressorConfig
@property
def input_shape(self):
return self.dense.input_shape
def forward(self, inputs, **kwargs):
return self.dense(inputs)
@DeveloperAPI
@register_decoder("projector", [VECTOR, TIMESERIES])
class Projector(Decoder):
def __init__(
self,
input_size,
output_size,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
activation=None,
multiplier=1.0,
clip=None,
decoder_config=None,
**kwargs,
):
super().__init__()
self.config = decoder_config
logger.debug(f" {self.name}")
logger.debug(" Dense")
self.dense = Dense(
input_size=input_size,
output_size=output_size,
use_bias=use_bias,
weights_initializer=weights_initializer,
bias_initializer=bias_initializer,
)
self.activation = get_activation(activation)
self.multiplier = multiplier
if clip is not None:
if isinstance(clip, (list, tuple)) and len(clip) == 2:
self.clip = partial(torch.clip, min=clip[0], max=clip[1])
else:
raise ValueError(
"The clip parameter of {} is {}. "
"It must be a list or a tuple of length 2.".format(self.feature_name, self.clip)
)
else:
self.clip = None
@staticmethod
def get_schema_cls():
return ProjectorConfig
@property
def input_shape(self):
return self.dense.input_shape
def forward(self, inputs, **kwargs):
values = self.activation(self.dense(inputs)) * self.multiplier
if self.clip:
values = self.clip(values)
return values
@DeveloperAPI
@register_decoder("classifier", [CATEGORY, CATEGORY_DISTRIBUTION, SET])
class Classifier(Decoder):
def __init__(
self,
input_size,
num_classes,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
decoder_config=None,
**kwargs,
):
super().__init__()
self.config = decoder_config
logger.debug(f" {self.name}")
logger.debug(" Dense")
self.num_classes = num_classes
self.dense = Dense(
input_size=input_size,
output_size=num_classes,
use_bias=use_bias,
weights_initializer=weights_initializer,
bias_initializer=bias_initializer,
)
self.sampled_loss = False
if LOSS in kwargs and TYPE in kwargs[LOSS] and kwargs[LOSS][TYPE] is not None:
self.sampled_loss = kwargs[LOSS][TYPE].startswith("sampled")
@staticmethod
def get_schema_cls():
return ClassifierConfig
@property
def input_shape(self):
return self.dense.input_shape
def forward(self, inputs, **kwargs):
return self.dense(inputs)
================================================
FILE: ludwig/decoders/image_decoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Aizen Corp.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ENCODER_OUTPUT_STATE, HIDDEN, IMAGE, LOGITS, PREDICTIONS
from ludwig.decoders.base import Decoder
from ludwig.decoders.registry import register_decoder
from ludwig.modules.convolutional_modules import UNetUpStack
from ludwig.schema.decoders.image_decoders import ImageDecoderConfig, UNetDecoderConfig
logger = logging.getLogger(__name__)
@DeveloperAPI
@register_decoder("unet", IMAGE)
class UNetDecoder(Decoder):
def __init__(
self,
input_size: int,
height: int,
width: int,
num_channels: int = 1,
num_classes: int = 2,
conv_norm: str | None = None,
decoder_config=None,
**kwargs,
):
super().__init__()
self.config = decoder_config
self.num_classes = num_classes
logger.debug(f" {self.name}")
if num_classes < 2:
raise ValueError(f"Invalid `num_classes` {num_classes} for unet decoder")
if height % 16 or width % 16:
raise ValueError(f"Invalid `height` {height} or `width` {width} for unet decoder")
self.unet = UNetUpStack(
img_height=height,
img_width=width,
out_channels=num_classes,
norm=conv_norm,
)
self.input_reshape = list(self.unet.input_shape)
self.input_reshape.insert(0, -1)
self._output_shape = (height, width)
def forward(self, combiner_outputs: dict[str, torch.Tensor], target: torch.Tensor):
hidden = combiner_outputs[HIDDEN]
skips = combiner_outputs[ENCODER_OUTPUT_STATE]
# unflatten combiner outputs
hidden = hidden.reshape(self.input_reshape)
logits = self.unet(hidden, skips)
predictions = logits.argmax(dim=1).squeeze(1).byte()
return {LOGITS: logits, PREDICTIONS: predictions}
def get_prediction_set(self):
return {LOGITS, PREDICTIONS}
@staticmethod
def get_schema_cls() -> type[ImageDecoderConfig]:
return UNetDecoderConfig
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@property
def input_shape(self) -> torch.Size:
return self.unet.input_shape
================================================
FILE: ludwig/decoders/llm_decoders.py
================================================
import logging
import re
from typing import Any
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import CATEGORY, LOGITS, PREDICTIONS, PROBABILITIES, TEXT
from ludwig.decoders.base import Decoder
from ludwig.decoders.registry import register_decoder
from ludwig.decoders.utils import extract_generated_tokens
from ludwig.schema.decoders.llm_decoders import CategoryExtractorDecoderConfig, TextExtractorDecoderConfig
from ludwig.utils.strings_utils import get_tokenizer
logger = logging.getLogger(__name__)
# TODO(Arnav): Refactor to split into strategies like splitters
class Matcher:
def __init__(self, match: dict[str, dict[str, Any]]):
self.match = match
def contains(self, decoded_input: str, value: str) -> bool:
return value in decoded_input
def regex(self, decoded_input: str, regex_pattern: str) -> bool:
"""Perform a regex match on a given text using a specified regex pattern.
Parameters:
text (str): The text to perform the match on.
regex_pattern (str): The regex pattern to use for the match.
Returns:
A list of match objects.
"""
# Compile the regex pattern
matches = []
try:
regex = re.compile(regex_pattern)
# Perform the match
matches = regex.findall(decoded_input)
except Exception:
logger.warning(f"Regex pattern {regex_pattern} could not be compiled.")
# If there is a match, matches is a non-empty list, so we can use this
# to infer if there was a match or not and return a bool
return len(matches) > 0
def __call__(self, decoded_input: str) -> str | None:
# Greedy match on first label that matches the input
for label, label_def in self.match.items():
label_def_type = label_def["type"]
label_def_value = label_def["value"]
if label_def_type == "contains":
is_match = self.contains(decoded_input, label_def_value)
elif label_def_type == "regex":
is_match = self.regex(decoded_input, label_def_value)
else:
raise ValueError(
f"{label_def_type} is not a valid match `type`. Ludwig "
"currently supports `contains` and `regex` match types."
)
if is_match:
return label
return None
@DeveloperAPI
@register_decoder("text_extractor", [TEXT])
class TextExtractorDecoder(Decoder):
def __init__(
self,
input_size: int,
decoder_config=None,
**kwargs,
):
super().__init__()
self.config = decoder_config
self.input_size = input_size
# Tokenizer
self.tokenizer_type = self.config.tokenizer
self.pretrained_model_name_or_path = self.config.pretrained_model_name_or_path
self.vocab_file = self.config.vocab_file
# Load tokenizer required for decoding the output from the generate
# function of the text input feature for LLMs.
self.tokenizer = get_tokenizer(self.tokenizer_type, self.vocab_file, self.pretrained_model_name_or_path)
if hasattr(self.tokenizer, "tokenizer"):
# Transformer Tokenizers
self.tokenizer_vocab_size = self.tokenizer.tokenizer.vocab_size
else:
# TorchText Tokenizers
self.tokenizer_vocab_size = len(self.tokenizer.vocab)
# Maximum number of new tokens that will be generated
# TODO(geoffrey): figure out where self.max_sequence_length is used– if not used, we might consider removing it.
# It's confusing to have both this and `max_new_tokens` as a mandatory param in the `forward` function.
self.max_sequence_length = self.config.max_new_tokens
@staticmethod
def get_schema_cls():
return TextExtractorDecoderConfig
@property
def input_shape(self):
return self.input_size
def get_prediction_set(self):
return {LOGITS, PREDICTIONS, PROBABILITIES}
def forward(self, inputs: list[torch.Tensor], input_lengths: list[int], max_new_tokens: int):
# Extract the sequences tensor from the LLMs forward pass
generated_outputs = extract_generated_tokens(
raw_generated_output_sequences=inputs,
input_lengths=input_lengths,
max_new_tokens=max_new_tokens,
pad_sequence=True,
)
# Stack the predictions for each example in the batch. The padding should ensure they are all the same shape.
for output in generated_outputs:
if output.shape[0] > max_new_tokens:
raise ValueError(
f"Output {output} is longer than the max_new_tokens {max_new_tokens} during decoding. "
f"This should never happen– please file an issue on GitHub."
)
generated_outputs = torch.stack(generated_outputs, dim=0)
outputs_device = generated_outputs.device
return {
PREDICTIONS: generated_outputs,
# TODO(Arnav): Add support for probabilities and logits
PROBABILITIES: torch.zeros((len(generated_outputs), max_new_tokens, self.tokenizer_vocab_size)).to(
outputs_device
),
LOGITS: torch.zeros((len(generated_outputs), max_new_tokens, self.tokenizer_vocab_size)).to(outputs_device),
}
@DeveloperAPI
@register_decoder("category_extractor", [CATEGORY])
class CategoryExtractorDecoder(Decoder):
def __init__(
self,
decoder_config=None,
**kwargs,
):
super().__init__()
self.config = decoder_config
self.input_size = self.config.input_size
self.fallback_label = self.config.fallback_label
self.str2idx = self.config.str2idx
self.vocab_size = len(self.config.str2idx)
# Create Matcher object to perform matching on the decoded output
self.matcher = Matcher(self.config.match)
# Tokenizer
self.tokenizer_type = self.config.tokenizer
self.pretrained_model_name_or_path = self.config.pretrained_model_name_or_path
self.vocab_file = self.config.vocab_file
# Load tokenizer required for decoding the output from the generate
# function of the text input feature for LLMs.
self.tokenizer = get_tokenizer(self.tokenizer_type, self.vocab_file, self.pretrained_model_name_or_path)
@staticmethod
def get_schema_cls():
return CategoryExtractorDecoderConfig
@property
def input_shape(self):
return self.input_size
def get_prediction_set(self):
return {LOGITS, PREDICTIONS, PROBABILITIES}
def forward(self, inputs: list[torch.Tensor], input_lengths: list[int], max_new_tokens: int):
# Extract the sequences tensor from the LLMs forward pass
generated_outputs = extract_generated_tokens(
raw_generated_output_sequences=inputs,
input_lengths=input_lengths,
max_new_tokens=max_new_tokens,
pad_sequence=False,
)
outputs_device = generated_outputs[0].device
# Decode generated outputs from the LLM's generate function.
decoded_outputs = self.tokenizer.tokenizer.batch_decode(generated_outputs, skip_special_tokens=True)
# Parse labels based on matching criteria and return probability vectors
matched_labels = []
probabilities = []
logits = []
for output in decoded_outputs:
output = output.lower() # Convert to lowercase for matching
matched_label = self.matcher(output)
idx = self.str2idx[matched_label] if matched_label in self.str2idx else self.str2idx[self.fallback_label]
# Append the index of the matched label
matched_labels.append(idx)
# Append the probability vector for the matched label
probability_vec = [0] * self.vocab_size
probability_vec[idx] = 1
probabilities.append(probability_vec)
# TODO(Arnav): Figure out how to compute logits. For now, we return
# a tensor of zeros.
logits.append([0] * self.vocab_size)
return {
PREDICTIONS: torch.tensor(matched_labels, device=outputs_device),
PROBABILITIES: torch.tensor(probabilities, dtype=torch.float32, device=outputs_device),
LOGITS: torch.tensor(logits, dtype=torch.float32, device=outputs_device),
}
================================================
FILE: ludwig/decoders/registry.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.decoders.base import Decoder
from ludwig.utils.registry import Registry
_decoder_registry = Registry()
@DeveloperAPI
def get_decoder_registry() -> Registry:
return _decoder_registry
@DeveloperAPI
def register_decoder(name: str, features: str | list[str]):
if isinstance(features, str):
features = [features]
def wrap(cls):
for feature in features:
feature_registry = get_decoder_registry().get(feature, {})
feature_registry[name] = cls
get_decoder_registry()[feature] = feature_registry
return cls
return wrap
@DeveloperAPI
def get_decoder_cls(feature: str, name: str) -> type[Decoder]:
return get_decoder_registry()[feature][name]
@DeveloperAPI
def get_decoder_classes(feature: str) -> dict[str, type[Decoder]]:
return get_decoder_registry()[feature]
================================================
FILE: ludwig/decoders/sequence_decoder_utils.py
================================================
"""Utility functions related to sequence decoders."""
import torch
from ludwig.constants import ENCODER_OUTPUT_STATE, HIDDEN
from ludwig.modules.reduction_modules import SequenceReducer
def repeat_2D_tensor(tensor, k):
"""Repeats a 2D-tensor k times over the first dimension.
For example:
Input: Tensor of [batch_size, state_size], k=2
Output: Tensor of [k, batch_size, state_size]
"""
if len(tensor.size()) > 2:
raise ValueError("Cannot repeat a non-2D tensor with this method.")
return tensor.repeat(k, 1, 1)
def get_rnn_init_state(
combiner_outputs: dict[str, torch.Tensor], sequence_reducer: SequenceReducer, num_layers: int
) -> torch.Tensor:
"""Computes the hidden state that the RNN decoder should start with.
Args:
combiner_outputs: Dictionary of tensors from the outputs of the combiner and other output features.
sequence_reducer: SequenceReducer to reduce rank-3 to rank-2.
num_layers: Number of layers the decoder uses.
Returns:
Tensor of [num_layers, batch_size, hidden_size].
"""
if ENCODER_OUTPUT_STATE not in combiner_outputs:
# Use the combiner's hidden state.
encoder_output_state = combiner_outputs[HIDDEN]
else:
# Use the encoder's output state.
encoder_output_state = combiner_outputs[ENCODER_OUTPUT_STATE]
if isinstance(encoder_output_state, tuple):
if len(encoder_output_state) == 2:
# LSTM encoder. Use the hidden state and ignore the cell state.
encoder_output_state = encoder_output_state[0]
elif len(encoder_output_state) == 4:
# Bi-directional LSTM encoder. Use the average of hidden states and ignore cell state.
encoder_output_state = torch.mean([encoder_output_state[0], encoder_output_state[2]])
else:
raise ValueError(
f"Invalid sequence decoder inputs with keys: {combiner_outputs.keys()} with extracted encoder "
+ f"state: {encoder_output_state.size()} that was invalid. Please double check the compatibility "
+ "of your encoder and decoder."
)
if len(encoder_output_state.size()) > 3:
raise ValueError("Init state for RNN decoders only works for 1d or 2d tensors (encoder_output).")
if len(encoder_output_state.size()) == 3:
# Reduce to [batch_size, hidden_size].
encoder_output_state = sequence_reducer(encoder_output_state)
return repeat_2D_tensor(encoder_output_state, num_layers)
def get_lstm_init_state(
combiner_outputs: dict[str, torch.Tensor], sequence_reducer: SequenceReducer, num_layers: int
) -> tuple[torch.Tensor, torch.Tensor]:
"""Returns the states that the LSTM decoder should start with.
Args:
combiner_outputs: Dictionary of tensors from the outputs of the combiner and other output features.
sequence_reducer: SequenceReducer to reduce rank-3 to rank-2.
num_layers: Number of layers the decoder uses.
Returns:
Tuple of 2 tensors (decoder hidden state, decoder cell state), each [num_layers, batch_size, hidden_size].
"""
if ENCODER_OUTPUT_STATE not in combiner_outputs:
# Use the combiner's hidden state.
decoder_hidden_state = combiner_outputs[HIDDEN]
decoder_cell_state = torch.clone(decoder_hidden_state)
else:
# Use the encoder's output state.
encoder_output_state = combiner_outputs[ENCODER_OUTPUT_STATE]
if not isinstance(encoder_output_state, tuple):
decoder_hidden_state = encoder_output_state
decoder_cell_state = decoder_hidden_state
else:
if len(encoder_output_state) == 2:
# The encoder was probably an LSTM.
decoder_hidden_state, decoder_cell_state = encoder_output_state
elif len(encoder_output_state) == 4:
# The encoder was probably a bi-LSTM.
# Use the average of the encoder's hidden states for hidden state.
# Use the average of the encoder's cell states for cell state.
decoder_hidden_state = torch.mean([encoder_output_state[0], encoder_output_state[2]])
decoder_cell_state = torch.mean([encoder_output_state[1], encoder_output_state[3]])
else:
raise ValueError(
f"Invalid sequence decoder inputs with keys: {combiner_outputs.keys()} with extracted encoder "
+ f"state: {encoder_output_state} that was invalid. Please double check the compatibility of your "
+ "encoder and decoder."
)
# Check rank and reduce if necessary.
if len(decoder_hidden_state.size()) > 3 or len(decoder_cell_state.size()) > 3:
raise ValueError(
f"Invalid sequence decoder inputs with keys: {combiner_outputs.keys()} with extracted encoder "
+ f"state: {decoder_hidden_state.size()} that was invalid. Please double check the compatibility "
+ "of your encoder and decoder."
)
if len(decoder_hidden_state.size()) == 3:
decoder_hidden_state = sequence_reducer(decoder_hidden_state)
if len(decoder_cell_state.size()) == 3:
decoder_cell_state = sequence_reducer(decoder_cell_state)
# Repeat over the number of layers.
return repeat_2D_tensor(decoder_hidden_state, num_layers), repeat_2D_tensor(decoder_cell_state, num_layers)
================================================
FILE: ludwig/decoders/sequence_decoders.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
import torch.nn as nn
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import LOGITS, PREDICTIONS, PROBABILITIES, SEQUENCE, TEXT
from ludwig.decoders.base import Decoder
from ludwig.decoders.registry import register_decoder
from ludwig.decoders.sequence_decoder_utils import get_lstm_init_state, get_rnn_init_state
from ludwig.modules.reduction_modules import SequenceReducer
from ludwig.schema.decoders.sequence_decoders import SequenceGeneratorDecoderConfig
from ludwig.utils import strings_utils
logger = logging.getLogger(__name__)
@DeveloperAPI
class RNNDecoder(nn.Module):
"""GRU or RNN-based decoder."""
def __init__(self, hidden_size: int, vocab_size: int, cell_type: str, num_layers: int = 1):
super().__init__()
self.hidden_size = hidden_size
self.vocab_size = vocab_size
self.embedding = nn.Embedding(vocab_size, hidden_size)
if cell_type == "gru":
self.rnn = nn.GRU(hidden_size, hidden_size, num_layers=num_layers, batch_first=True)
else:
self.rnn = nn.RNN(hidden_size, hidden_size, num_layers=num_layers, batch_first=True)
self.out = nn.Linear(hidden_size, vocab_size)
# Have the embedding and projection share weights.
# This is a trick used by the Transformer, and seems to attain better loss.
# See section 3.4 of https://arxiv.org/pdf/1706.03762.pdf.
self.out.weight = self.embedding.weight
def forward(self, input: torch.Tensor, hidden: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
"""Runs a single decoding time step.
Modeled off of https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html.
Args:
input: [batch_size] tensor with the previous step's predicted symbol.
hidden: [batch_size, hidden_size] tensor with the previous step's hidden state.
Returns:
Tuple of two tensors:
- output: [batch_size, 1, vocab_size] tensor with the logits.
- hidden: [num_layers, batch_size, hidden_size] tensor with the hidden state for the next time step.
"""
# Unsqueeze predicted tokens.
input = input.unsqueeze(1).to(torch.int)
output = self.embedding(input)
output, hidden = self.rnn(output, hidden)
output_logits = self.out(output)
return output_logits, hidden
@DeveloperAPI
class LSTMDecoder(nn.Module):
"""LSTM-based decoder."""
def __init__(self, hidden_size: int, vocab_size: int, num_layers: int = 1):
super().__init__()
self.hidden_size = hidden_size
self.vocab_size = vocab_size
self.embedding = nn.Embedding(vocab_size, hidden_size)
self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True, num_layers=num_layers)
self.out = nn.Linear(hidden_size, vocab_size)
# Have the embedding and projection share weights.
# This is a trick used by the Transformer, and seems to attain better loss.
# See section 3.4 of https://arxiv.org/pdf/1706.03762.pdf.
self.out.weight = self.embedding.weight
def forward(
self, input: torch.Tensor, hidden_state: torch.Tensor, cell_state: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""Runs a single decoding time step.
Modeled off of https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html.
Args:
input: [batch_size] tensor with the previous step's predicted symbol.
hidden_state: [batch_size, hidden_size] tensor with the previous step's hidden state.
cell_state: [batch_size, hidden_size] tensor with the previous step's cell state.
Returns:
Tuple of 3 tensors:
- output: [batch_size, vocab_size] tensor with the logits.
- hidden_state: [batch_size, hidden_size] tensor with the hidden state for the next time step.
- cell_state: [batch_size, hidden_size] tensor with the cell state for the next time step.
"""
# Unsqueeze predicted tokens.
input = input.unsqueeze(1).to(torch.int)
output = self.embedding(input)
output, (hidden_state, cell_state) = self.lstm(output, (hidden_state, cell_state))
output_logits = self.out(output)
return output_logits, hidden_state, cell_state
@DeveloperAPI
class SequenceRNNDecoder(nn.Module):
"""RNN-based decoder over multiple time steps."""
def __init__(
self,
hidden_size: int,
vocab_size: int,
max_sequence_length: int,
cell_type: str,
num_layers: int = 1,
reduce_input="sum",
):
super().__init__()
self.hidden_size = hidden_size
self.vocab_size = vocab_size
self.rnn_decoder = RNNDecoder(hidden_size, vocab_size, cell_type, num_layers=num_layers)
self.max_sequence_length = max_sequence_length
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_input)
self.num_layers = num_layers
self.register_buffer("logits", torch.zeros([max_sequence_length, vocab_size]))
self.register_buffer("decoder_input", torch.Tensor([strings_utils.SpecialSymbol.START.value]))
def forward(self, combiner_outputs: dict[str, torch.Tensor], target: torch.Tensor):
"""Runs max_sequence_length RNN decoding time steps.
Args:
combiner_outputs: Dictionary of tensors from the outputs of the combiner and other output features.
target: Tensor [batch_size, max_sequence_length] with target symbols.
Returns:
Tensor of logits [batch_size, max_sequence_length, vocab_size].
"""
# Prepare the encoder output state.
decoder_hidden = get_rnn_init_state(combiner_outputs, self.reduce_sequence, self.num_layers)
batch_size = decoder_hidden.size()[1]
# Tensor to store decoder output logits.
logits = self.logits.unsqueeze(0).repeat(batch_size, 1, 1)
# Initialize the decoder with start symbols.
decoder_input = self.decoder_input.repeat(batch_size)
# Unsqueeze to account for extra multilayer dimension.
# decoder_hidden = encoder_output_state.unsqueeze(0)
# Decode until max length.
for di in range(self.max_sequence_length):
decoder_output, decoder_hidden = self.rnn_decoder(decoder_input, decoder_hidden)
# decoder_output: [batch_size, 1, vocab_size]
# Squeeze out the multilayer dimension and save logits.
logits[:, di, :] = decoder_output.squeeze(1)
# Determine inputs for next time step.
# Using teacher forcing causes the model to converge faster but when the trained network is exploited, it
# may be unstable: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf.
# TODO: Use a configurable ratio for how often to use teacher forcing during training.
if target is None:
_, topi = decoder_output.topk(1)
# Squeeze out multilayer and vocabulary dimensions.
decoder_input = topi.squeeze(1).squeeze(1).detach() # detach from history as input
else:
# Teacher forcing.
decoder_input = target[:, di]
return logits
@DeveloperAPI
class SequenceLSTMDecoder(nn.Module):
"""LSTM-based decoder over multiple time steps."""
def __init__(
self,
hidden_size: int,
vocab_size: int,
max_sequence_length: int,
reduce_input: str = "sum",
num_layers: int = 1,
):
super().__init__()
self.hidden_size = hidden_size
self.vocab_size = vocab_size
self.lstm_decoder = LSTMDecoder(hidden_size, vocab_size, num_layers)
self.max_sequence_length = max_sequence_length
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_input)
self.num_layers = num_layers
self.register_buffer("logits", torch.zeros([max_sequence_length, vocab_size]))
self.register_buffer("decoder_input", torch.Tensor([strings_utils.SpecialSymbol.START.value]))
def forward(self, combiner_outputs: dict[str, torch.Tensor], target: torch.Tensor) -> torch.Tensor:
"""Runs max_sequence_length LSTM decoding time steps.
Args:
combiner_outputs: Dictionary of tensors from the outputs of the combiner and other output features.
target: Tensor [batch_size, max_sequence_length] with target symbols.
Returns:
Tensor of logits [batch_size, max_sequence_length, vocab_size].
"""
# Prepare the decoder initial state.
decoder_hidden, decoder_cell_state = get_lstm_init_state(
combiner_outputs, self.reduce_sequence, self.num_layers
)
batch_size = decoder_hidden.size()[1]
# Initialize the decoder with start symbols.
decoder_input = self.decoder_input.repeat(batch_size)
# Tensor to store decoder output logits.
logits = self.logits.unsqueeze(0).repeat(batch_size, 1, 1)
# Decode until max length.
for di in range(self.max_sequence_length):
decoder_output, decoder_hidden, decoder_cell_state = self.lstm_decoder(
decoder_input, decoder_hidden, decoder_cell_state
)
# decoder_output: [batch_size, 1, vocab_size]
# Squeeze out the multilayer dimension and save logits.
logits[:, di, :] = decoder_output.squeeze(1)
# Determine inputs for next time step.
# Using teacher forcing causes the model to converge faster but when the trained network is exploited, it
# may be unstable: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf.
# TODO: Use a configurable ratio for how often to use teacher forcing during training.
if target is None:
_, topi = decoder_output.topk(1)
# Squeeze out multilayer and vocabulary dimensions.
decoder_input = topi.squeeze(1).squeeze(1).detach() # detach from history as input
else:
# Teacher forcing.
decoder_input = target[:, di]
return logits
@DeveloperAPI
@register_decoder("generator", [SEQUENCE, TEXT])
class SequenceGeneratorDecoder(Decoder):
"""Dispatcher for different sequence generator decoders."""
def __init__(
self,
vocab_size: int,
max_sequence_length: int,
cell_type: str = "gru",
input_size: int = 256,
reduce_input: str = "sum",
num_layers: int = 1,
decoder_config=None,
**kwargs,
):
"""
Args:
vocab_size: Vocab size.
max_sequence_length: Maximum sequence length.
cell_type: Type of RNN cell to use. 'rnn', 'gru', or 'lstm'.
input_size: Size of incoming combiner output.
reduce_input: Mode with which to reduce incoming combiner output, if needed.
num_layers: Number of layers for the RNN deecoders.
"""
super().__init__()
self.config = decoder_config
self.vocab_size = vocab_size
self.input_size = input_size
self.max_sequence_length = max_sequence_length
if cell_type == "lstm":
self.rnn_decoder = SequenceLSTMDecoder(
hidden_size=input_size,
vocab_size=vocab_size,
max_sequence_length=max_sequence_length,
reduce_input=reduce_input,
num_layers=num_layers,
)
else:
self.rnn_decoder = SequenceRNNDecoder(
hidden_size=input_size,
vocab_size=vocab_size,
max_sequence_length=max_sequence_length,
cell_type=cell_type,
reduce_input=reduce_input,
num_layers=num_layers,
)
def forward(
self, combiner_outputs: dict[str, torch.Tensor], target: torch.Tensor = None
) -> dict[str, torch.Tensor]:
"""Decodes combiner_outputs into a sequence.
Args:
combiner_outputs: Dictionary of tensors from the outputs of the combiner and other output features.
target: Tensor [batch_size, max_sequence_length] with target symbols.
Returns:
Dictionary of tensors of logits [batch_size, max_sequence_length, vocab_size].
"""
logits = self.rnn_decoder(combiner_outputs, target)
return {LOGITS: logits}
def get_prediction_set(self):
return {LOGITS, PREDICTIONS, PROBABILITIES}
@staticmethod
def get_schema_cls():
return SequenceGeneratorDecoderConfig
@property
def input_shape(self):
# Dummy implementation.
return torch.Size([1])
@property
def output_shape(self):
return torch.Size([self.max_sequence_length, self.vocab_size])
================================================
FILE: ludwig/decoders/sequence_tagger.py
================================================
import logging
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import HIDDEN, LOGITS, PREDICTIONS, PROBABILITIES, SEQUENCE, TEXT
from ludwig.decoders.base import Decoder
from ludwig.decoders.registry import register_decoder
from ludwig.modules.attention_modules import MultiHeadSelfAttention
from ludwig.schema.decoders.sequence_decoders import SequenceTaggerDecoderConfig
from ludwig.utils.torch_utils import Dense
logger = logging.getLogger(__name__)
@DeveloperAPI
@register_decoder("tagger", [SEQUENCE, TEXT])
class SequenceTaggerDecoder(Decoder):
def __init__(
self,
input_size: int,
vocab_size: int,
max_sequence_length: int,
use_attention: bool = False,
use_bias: bool = True,
attention_embedding_size: int = 256,
attention_num_heads: int = 8,
decoder_config=None,
**kwargs,
):
super().__init__()
self.config = decoder_config
self.vocab_size = vocab_size
self.max_sequence_length = max_sequence_length
self.input_size = input_size
self.use_attention = use_attention
if use_attention:
logger.debug(" MultiHeadSelfAttention")
self.self_attention = MultiHeadSelfAttention(
input_size=input_size, hidden_size=attention_embedding_size, num_heads=attention_num_heads
)
# Adjust the input size to the final projection layer.
input_size = self.self_attention.output_shape[0]
self.projection_layer = Dense(input_size=input_size, output_size=vocab_size, use_bias=use_bias)
def forward(self, inputs: dict[str, torch.Tensor], target: torch.Tensor = None) -> dict[str, torch.Tensor]:
"""Decodes the inputs into a sequence.
Args:
inputs: Dictionary of tensors from the outputs of the combiner and other output features.
target: Tensor [batch_size, max_sequence_length] with predictions.
Returns:
Dictionary of tensors with logits [batch_size, max_sequence_length, vocab_size].
"""
hidden = inputs[HIDDEN]
if len(hidden.size()) != 3:
raise ValueError(
f"Decoder inputs rank is {len(hidden.size())}, but should be 3: "
+ "[batch_size x max_sequence_length x hidden_size] in when using a tagger sequential decoder. "
+ "Consider setting reduce_output to None if a sequential encoder / combiner is used."
)
if list(hidden.shape[1:]) != [self.max_sequence_length, self.input_size]:
raise ValueError(
"Sequence tagger decoder inputs (hidden) should be [batch_size, self.max_sequence_length, "
+ f"input_size], or [batch_size, {self.max_sequence_length}, {self.input_size}]. However, the "
+ f"inputs (hidden) was instead: {list(hidden.size())}. "
+ "The encoder is not length preserving. Please check its configuration."
)
if self.use_attention:
hidden = self.self_attention(hidden)
logits = self.projection_layer(hidden)
return {LOGITS: logits}
def get_prediction_set(self):
return {LOGITS, PROBABILITIES, PREDICTIONS}
@staticmethod
def get_schema_cls():
return SequenceTaggerDecoderConfig
@property
def input_shape(self):
# Dummy implementation.
return torch.Size([1])
@property
def output_shape(self):
return torch.Size([self.max_sequence_length, self.vocab_size])
================================================
FILE: ludwig/decoders/utils.py
================================================
import torch
from torch import Tensor
def extract_generated_tokens(
raw_generated_output_sequences: list[Tensor],
input_lengths: list[int],
max_new_tokens: int,
pad_sequence: bool,
) -> list[Tensor]:
"""Extracts the generated tokens from the raw output sequences of the language model.
Args:
raw_generated_output_sequences: The raw output sequences of the language model.
Represented as a list to handle variable length sequences.
input_lengths: The length of the inputs to the language model.
max_new_tokens: The maximum number of new tokens that were generated. Used to
pad the generated sequences to the max_new_tokens.
pad_sequence: Whether to pad the generated sequences to the max_new_tokens.
Returns:
The generated tokens.
"""
if len(raw_generated_output_sequences) != len(input_lengths):
raise ValueError(
f"The number of raw_generated_output_sequences ({len(raw_generated_output_sequences)}) "
f"must be the same as the number of input_lengths ({len(input_lengths)})."
)
generated_outputs = []
for idx, input_length in enumerate(input_lengths):
# Remove the input sequence from the generated sequence
generated_sequence = raw_generated_output_sequences[idx][input_length:]
# Pad the sequence if it is shorter than the max_new_tokens for downstream metric computation
if pad_sequence and generated_sequence.size()[0] < max_new_tokens:
generated_sequence = torch.nn.functional.pad(
generated_sequence, (0, max_new_tokens - generated_sequence.size()[0]), "constant", 0
)
generated_outputs.append(generated_sequence)
return generated_outputs
================================================
FILE: ludwig/distributed/__init__.py
================================================
from typing import Any
from ludwig.distributed.base import DistributedStrategy, LocalStrategy
def load_ddp():
from ludwig.distributed.ddp import DDPStrategy
return DDPStrategy
def load_fsdp():
from ludwig.distributed.fsdp import FSDPStrategy
return FSDPStrategy
def load_deepspeed():
from ludwig.distributed.deepspeed import DeepSpeedStrategy
return DeepSpeedStrategy
def load_local():
return LocalStrategy
STRATEGIES = {
"ddp": load_ddp,
"fsdp": load_fsdp,
"deepspeed": load_deepspeed,
"local": load_local,
}
_current_strategy: DistributedStrategy = None
def init_dist_strategy(strategy: str | dict[str, Any], **kwargs) -> DistributedStrategy:
global _current_strategy
if isinstance(strategy, dict):
dtype = strategy.pop("type", None)
obj = get_dist_strategy(dtype)(**strategy)
else:
obj = get_dist_strategy(strategy)(**kwargs)
_current_strategy = obj
return obj
def get_current_dist_strategy() -> DistributedStrategy:
if _current_strategy is None:
raise RuntimeError("Distributed strategy not initialized")
return _current_strategy
def get_dist_strategy(strategy: str | dict[str, Any]) -> type[DistributedStrategy]:
name = strategy
if isinstance(strategy, dict):
name = strategy["type"]
return STRATEGIES[name]()
def get_default_strategy_name() -> str:
return "ddp"
================================================
FILE: ludwig/distributed/base.py
================================================
from __future__ import annotations
import contextlib
from abc import ABC, abstractmethod
from collections.abc import Callable
from typing import Any, TYPE_CHECKING
import torch
from torch import nn
from torch.optim import Optimizer
from ludwig.modules.optimization_modules import create_optimizer
from ludwig.utils.torch_utils import get_torch_device
if TYPE_CHECKING:
from ray.train.backend import BackendConfig
from ray.train.data_parallel_trainer import DataParallelTrainer
from ludwig.models.base import BaseModel
from ludwig.modules.lr_scheduler import LRScheduler
from ludwig.schema.trainer import ECDTrainerConfig
from ludwig.utils.checkpoint_utils import Checkpoint
class DistributedStrategy(ABC):
"""Interface that wraps a distributed training framework (DDP, FSDP, DeepSpeed).
Distributed strategies modify the model and/or optimizer to coordinate gradient updates among multiple workers
running in parallel. In most cases, these are using collective communication libraries pass messages between
processes.
"""
@abstractmethod
def prepare(
self,
model: nn.Module,
trainer_config: ECDTrainerConfig,
base_learning_rate: float,
) -> tuple[nn.Module, Optimizer]:
"""Modifies the model to support distributed training and creates the optimizer.
Args:
model: The model to wrap for distributed training.
trainer_config: The trainer configuration, which includes optimizer params.
base_learning_rate: The base learning rate to init the optimizer, which may be scaled by the strategy.
Returns:
A tuple of the wrapped model and the optimizer.
"""
def prepare_for_inference(self, model: nn.Module) -> nn.Module:
return model
def to_device(self, model: BaseModel, device: torch.device | None = None) -> nn.Module:
return model.to_device(device if device is not None else get_torch_device())
def backward(self, loss: torch.Tensor, model: nn.Module):
loss.backward()
def step(self, optimizer: Optimizer, *args, **kwargs):
optimizer.step(*args, **kwargs)
def zero_grad(self, optimizer: Optimizer):
optimizer.zero_grad()
def set_batch_size(self, model: nn.Module, batch_size: int):
pass
@abstractmethod
def size(self) -> int:
pass
@abstractmethod
def rank(self) -> int:
pass
@abstractmethod
def local_size(self) -> int:
pass
@abstractmethod
def local_rank(self) -> int:
pass
def is_coordinator(self) -> bool:
return self.rank() == 0
@abstractmethod
def barrier(self):
pass
@abstractmethod
def allreduce(self, t: torch.Tensor) -> torch.Tensor:
pass
@abstractmethod
def broadcast(self, t: torch.Tensor) -> torch.Tensor:
pass
@abstractmethod
def sync_model(self, model: nn.Module):
pass
@abstractmethod
def sync_optimizer(self, optimizer: Optimizer):
pass
@abstractmethod
def broadcast_object(self, v: Any, name: str | None = None) -> Any:
pass
@abstractmethod
def wait_optimizer_synced(self, optimizer: Optimizer):
pass
@abstractmethod
@contextlib.contextmanager
def prepare_model_update(self, model: nn.Module, should_step: bool):
pass
@abstractmethod
@contextlib.contextmanager
def prepare_optimizer_update(self, optimizer: Optimizer):
pass
@classmethod
@abstractmethod
def is_available(cls) -> bool:
pass
@classmethod
@abstractmethod
def gather_all_tensors_fn(cls) -> Callable | None:
pass
@classmethod
@abstractmethod
def get_ray_trainer_backend(cls, **kwargs) -> Any | None:
pass
@classmethod
@abstractmethod
def get_trainer_cls(cls, backend_config: BackendConfig) -> tuple[type[DataParallelTrainer], dict[str, Any]]:
pass
@abstractmethod
def shutdown(self):
pass
def return_first(self, fn: Callable) -> Callable:
"""Wraps function so results are only returned by the first (coordinator) rank.
The purpose of this function is to reduce network overhead.
"""
def wrapped(*args, **kwargs):
res = fn(*args, **kwargs)
return res if self.rank() == 0 else None
return wrapped
def allow_gradient_accumulation(self) -> bool:
return True
def allow_mixed_precision(self) -> bool:
return True
def allow_clip_gradients(self) -> bool:
return True
def prepare_before_load(self) -> bool:
"""True if we need to call `prepare` again before loading a checkpoint."""
return False
@classmethod
def is_model_parallel(cls) -> bool:
return False
def create_checkpoint_handle(
self,
dist_model: nn.Module,
model: nn.Module,
optimizer: Optimizer | None = None,
scheduler: LRScheduler | None = None,
) -> Checkpoint:
from ludwig.utils.checkpoint_utils import MultiNodeCheckpoint
return MultiNodeCheckpoint(self, model, optimizer, scheduler)
@classmethod
def extract_model_for_serialization(cls, model: nn.Module) -> nn.Module | tuple[nn.Module, list[dict]]:
return model
@classmethod
def replace_model_from_serialization(cls, state: nn.Module | tuple[nn.Module, list[dict]]) -> nn.Module:
assert isinstance(state, nn.Module)
return state
class LocalStrategy(DistributedStrategy):
def prepare(
self,
model: nn.Module,
trainer_config: ECDTrainerConfig,
base_learning_rate: float,
) -> tuple[nn.Module, Optimizer]:
return model, create_optimizer(model, trainer_config.optimizer, base_learning_rate)
def size(self) -> int:
return 1
def rank(self) -> int:
return 0
def local_size(self) -> int:
return 0
def local_rank(self) -> int:
return 0
def barrier(self):
pass
def allreduce(self, t: torch.Tensor) -> torch.Tensor:
return t
def broadcast(self, t: torch.Tensor) -> torch.Tensor:
return t
def sync_model(self, model: nn.Module):
pass
def sync_optimizer(self, optimizer: Optimizer):
pass
def broadcast_object(self, v: Any, name: str | None = None) -> Any:
return v
def wait_optimizer_synced(self, optimizer: Optimizer):
pass
@contextlib.contextmanager
def prepare_model_update(self, model: nn.Module, should_step: bool):
yield
@contextlib.contextmanager
def prepare_optimizer_update(self, optimizer: Optimizer):
yield
@classmethod
def is_available(cls) -> bool:
# While this strategy is always an option, it is not "distributed" which is the meaning of availability
# in this context.
return False
@classmethod
def gather_all_tensors_fn(cls) -> Callable | None:
return None
@classmethod
def get_ray_trainer_backend(cls, **kwargs) -> Any | None:
return None
@classmethod
def get_trainer_cls(cls, backend_config: BackendConfig) -> tuple[type[DataParallelTrainer], dict[str, Any]]:
raise ValueError("Cannot construct a trainer from a local strategy.")
def shutdown(self):
pass
================================================
FILE: ludwig/distributed/ddp.py
================================================
import contextlib
import logging
import os
import socket
from collections.abc import Callable
from typing import Any, Optional, TYPE_CHECKING, Union
import torch
import torch.distributed as dist
from ray.train.backend import BackendConfig
from ray.train.data_parallel_trainer import DataParallelTrainer
from ray.train.torch import TorchTrainer
from torch import nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.optim import Optimizer
from torchmetrics.utilities.distributed import gather_all_tensors
from ludwig.distributed.base import DistributedStrategy
from ludwig.modules.optimization_modules import create_optimizer
from ludwig.utils.torch_utils import get_torch_device
if TYPE_CHECKING:
from ludwig.models.base import BaseModel
from ludwig.modules.lr_scheduler import LRScheduler
from ludwig.schema.trainer import ECDTrainerConfig
from ludwig.utils.checkpoint_utils import Checkpoint
class DDPStrategy(DistributedStrategy):
def __init__(self):
self._local_rank, self._local_size = local_rank_and_size()
self._log_on_init()
def _log_on_init(self):
logging.info("Using DDP strategy")
def prepare(
self,
model: nn.Module,
trainer_config: "ECDTrainerConfig",
base_learning_rate: float,
) -> tuple[nn.Module, Optimizer]:
return DDP(model), create_optimizer(model, trainer_config.optimizer, base_learning_rate)
def size(self) -> int:
return dist.get_world_size()
def rank(self) -> int:
return dist.get_rank()
def local_size(self) -> int:
return self._local_size
def local_rank(self) -> int:
return self._local_rank
def barrier(self):
return dist.barrier()
def allreduce(self, t: torch.Tensor) -> torch.Tensor:
dist.all_reduce(t)
return t
def broadcast(self, t: torch.Tensor) -> torch.Tensor:
dist.broadcast(t)
return t
def sync_model(self, model: nn.Module):
# TODO(travis): open question if this is needed to ensure all workers using same weights
pass
def sync_optimizer(self, optimizer: Optimizer):
# TODO(travis): open question if this is needed to ensure all workers using same optimizer state
pass
def broadcast_object(self, v: Any, name: str | None = None) -> Any:
output = [v]
dist.broadcast_object_list(output)
return output[0]
def wait_optimizer_synced(self, optimizer: Optimizer):
pass
@contextlib.contextmanager
def prepare_model_update(self, model: nn.Module, should_step: bool):
if should_step:
yield
else:
# Prevents DDP from syncing gradients during accumulation step
with model.no_sync():
yield
@contextlib.contextmanager
def prepare_optimizer_update(self, optimizer: Optimizer):
yield
@classmethod
def is_available(cls) -> bool:
return dist.is_available() and dist.is_initialized()
@classmethod
def gather_all_tensors_fn(cls) -> Callable | None:
return gather_all_tensors
@classmethod
def get_ray_trainer_backend(cls, **kwargs) -> Any | None:
from ray.train.torch import TorchConfig
return TorchConfig()
@classmethod
def get_trainer_cls(cls, backend_config: BackendConfig) -> tuple[type[DataParallelTrainer], dict[str, Any]]:
return TorchTrainer, dict(torch_config=backend_config)
def shutdown(self):
# TODO(travis): currently Ray handles this for us, but is subject to hangs if one of the workers raises an
# exception and the other makes a collective op. We should figure out a way to make this safe to call
# multiple times. It looks like there is a fix we can make use of when we upgrade to Ray 2.1:
# https://discuss.ray.io/t/torchtrainer-hangs-when-only-1-worker-raises-error/7447/11
# dist.destroy_process_group()
pass
def create_checkpoint_handle(
self,
dist_model: nn.Module,
model: nn.Module,
optimizer: Optimizer | None = None,
scheduler: Optional["LRScheduler"] = None,
) -> "Checkpoint":
from ludwig.utils.checkpoint_utils import MultiNodeCheckpoint
return MultiNodeCheckpoint(self, model, optimizer, scheduler)
def to_device(self, model: Union["BaseModel", DDP], device: torch.device | None = None) -> nn.Module:
try:
return model.to_device(device if device is not None else get_torch_device())
except AttributeError:
# Model is already wrapped in DistributedDataParallel, so it has already been moved to device
return model
def local_rank_and_size() -> tuple[int, int]:
# DeepSpeed CLI and other tools may set these environment variables for us.
local_rank, local_size = os.environ.get("LOCAL_RANK"), os.environ.get("LOCAL_SIZE")
if local_rank is not None and local_size is not None:
return int(local_rank), int(local_size)
# Gather the rank and hostnames from every worker so we can count up how many belong to the same host, which
# constitutes the local group.
rank = dist.get_rank()
host = socket.gethostname()
output = [None for _ in range(dist.get_world_size())]
dist.all_gather_object(output, (rank, host))
# Every time we find a worker with the same host, we increment the size counter.
# The local rank is determined by the world rank relative to the other workers on the same host, so every time
# we see a worker on our host with a lower rank, we increment the rank counter.
local_size = 0
local_rank = 0
for other_rank, other_host in output:
if other_host == host:
local_size += 1
if other_rank < rank:
local_rank += 1
return local_rank, local_size
================================================
FILE: ludwig/distributed/deepspeed.py
================================================
import logging
import os
import warnings
from collections.abc import Mapping
from typing import Any, Optional, TYPE_CHECKING
import deepspeed
import deepspeed.comm
import torch
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
from packaging import version
from torch import nn
from torch.optim.optimizer import Optimizer
from ludwig.constants import MIN_POSSIBLE_BATCH_SIZE
from ludwig.distributed.ddp import DDPStrategy
from ludwig.modules.optimization_modules import get_optimizer_class_and_kwargs
from ludwig.utils.checkpoint_utils import Checkpoint
from ludwig.utils.model_utils import extract_tensors, replace_tensors
_deepspeed_0101 = version.parse(deepspeed.__version__) >= version.parse("0.10.1")
if TYPE_CHECKING:
from ludwig.modules.lr_scheduler import LRScheduler
from ludwig.schema.trainer import ECDTrainerConfig
DEFAULT_ZERO_OPTIMIZATION = {
"stage": "auto",
"stage3_gather_16bit_weights_on_model_save": "auto",
"offload_optimizer": {"device": "auto"},
"offload_param": {"device": "auto"},
}
# Filter out warnings about DeepSpeed use of deprecated methods. Can remove on upgrade to DeepSpeed 0.9.
warnings.filterwarnings(
action="ignore",
category=UserWarning,
module="torch.distributed.distributed_c10d",
)
class DeepSpeedStrategy(DDPStrategy):
def __init__(
self,
zero_optimization: dict[str, Any] | None = None,
fp16: dict[str, Any] | None = None,
bf16: dict[str, Any] | None = None,
compression_training: dict[str, Any] | None = None,
**kwargs
):
# If we're initializing from a `deepspeed` CLI command, deepspeed will have already been initialized, as
# indicated by the presence of the LOCAL_RANK var. Otherwise, we're initializing from Ray / torchrun, and will
# need to set this var ourselves, then init DeepSpeed here.
local_rank, local_size = os.environ.get("LOCAL_RANK"), os.environ.get("LOCAL_SIZE")
init_deepspeed = local_rank is None or local_size is None
super().__init__(**kwargs)
self.zero_optimization = zero_optimization or DEFAULT_ZERO_OPTIMIZATION
self.fp16 = fp16
self.bf16 = bf16
self.compression_training = compression_training
if init_deepspeed:
os.environ["LOCAL_RANK"] = str(self.local_rank())
os.environ["LOCAL_SIZE"] = str(self.local_size())
os.environ["RANK"] = str(self.rank())
os.environ["WORLD_SIZE"] = str(self.size())
deepspeed.init_distributed()
def _log_on_init(self):
logging.info("Using DeepSpeed strategy")
def prepare(
self,
model: nn.Module,
trainer_config: "ECDTrainerConfig",
base_learning_rate: float,
) -> tuple[nn.Module, Optimizer]:
# If `batch_size=auto`, we set to MIN_POSSIBLE_BATCH_SIZE temporarily until auto-tuning adjusts it`
# We can really set it to be whatever we want, as it will be overridden by the auto-tuning.
batch_size = (
trainer_config.batch_size if isinstance(trainer_config.batch_size, int) else MIN_POSSIBLE_BATCH_SIZE
)
# Paged and 8-bit optimizers are not supported by Deepspeed - just whatever is supported
# by torch.optim.Optimizer. https://www.deepspeed.ai/docs/config-json/#optimizer-parameters.
if trainer_config.optimizer.is_paged or trainer_config.optimizer.is_8bit:
raise ValueError("Cannot use a paged or 8-bit optimizer with DeepSpeed.")
optimizer_cls, optimizer_kwargs = get_optimizer_class_and_kwargs(trainer_config.optimizer, base_learning_rate)
ds_config = {
"amp": {
"enabled": trainer_config.use_mixed_precision,
},
"optimizer": {"type": optimizer_cls.__name__, "params": optimizer_kwargs},
"zero_optimization": self.zero_optimization,
"gradient_clipping": trainer_config.gradient_clipping.clipglobalnorm,
"train_micro_batch_size_per_gpu": batch_size,
"gradient_accumulation_steps": trainer_config.gradient_accumulation_steps,
"steps_per_print": trainer_config.steps_per_checkpoint or 10000,
}
# DeepSpeed doesn't like passing these params as None values
if self.fp16 is not None:
ds_config["fp16"] = self.fp16
if self.bf16 is not None:
ds_config["bf16"] = self.bf16
if self.compression_training is not None:
ds_config["compression_training"] = self.compression_training
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
model_parameters=model.parameters(),
lr_scheduler=None, # Don't let DeepSpeed manage the learning rate scheduler
config=ds_config,
dist_init_required=False,
)
if hasattr(optimizer, "optimizer"):
# Zero-3 wraps the optimizer
optimizer = optimizer.optimizer
return model_engine, optimizer
def prepare_for_inference(self, model: nn.Module) -> nn.Module:
ds_config = {}
model_engine = deepspeed.init_inference(model=model, config=ds_config)
return model_engine
def to_device(self, model: nn.Module, device: torch.device | None = None) -> nn.Module:
return model
def backward(self, loss: torch.Tensor, model: nn.Module):
# See: https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/deepspeed.py
# runs backpropagation and handles mixed precision
model.backward(loss)
# Deepspeed's `engine.step` performs the following operations:
# - gradient accumulation check
# - gradient clipping
# - optimizer step
# - zero grad
# - checking overflow
# - lr_scheduler step (only if engine.lr_scheduler is not None)
model.step()
# and this plugin overrides the above calls with no-ops when Accelerate runs under
# Deepspeed, but allows normal functionality for non-Deepspeed cases thus enabling a simple
# training loop that works transparently under many training regimes.
def step(self, optimizer: Optimizer, *args, **kwargs):
# Handled by `self.backward(loss)`
pass
def zero_grad(self, optimizer: Optimizer):
# Handled by `self.backward(loss)`
pass
def set_batch_size(self, model: nn.Module, batch_size: int):
# Adapted from:
# https://github.com/microsoft/DeepSpeed/blob/7ce371b139521b1ebbf052f0496b1a16397c1d19/deepspeed/runtime/engine.py#L422 # noqa: E501
model._config.micro_batch_size_per_gpu = batch_size
model._config.train_batch_size = batch_size * self.size() * model._config.gradient_accumulation_steps
def barrier(self):
deepspeed.comm.barrier()
def allow_gradient_accumulation(self) -> bool:
"""DeepSpeed handles gradient accumulation internally."""
return False
def allow_mixed_precision(self) -> bool:
"""DeepSpeed handles mixed precision internally."""
return False
def allow_clip_gradients(self) -> bool:
"""DeepSpeed handles gradient clipping internally."""
return False
def prepare_before_load(self) -> bool:
"""DeepSpeed requires the engine to be re-initialized before loading.
https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#loading-training-checkpoints
"""
return True
@classmethod
def is_model_parallel(cls) -> bool:
return True
def create_checkpoint_handle(
self,
dist_model: nn.Module,
model: nn.Module,
optimizer: Optimizer | None = None,
scheduler: Optional["LRScheduler"] = None,
) -> Checkpoint:
return DeepSpeedCheckpoint(self, dist_model, optimizer, scheduler)
@classmethod
def extract_model_for_serialization(cls, model: nn.Module) -> nn.Module | tuple[nn.Module, list[dict]]:
return extract_tensors(model)
@classmethod
def replace_model_from_serialization(cls, state: nn.Module | tuple[nn.Module, list[dict]]) -> nn.Module:
assert isinstance(state, tuple)
model, model_weights = state
replace_tensors(model, model_weights, torch.device("cpu"))
return model
class DeepSpeedCheckpoint(Checkpoint):
def prepare(self, directory: str):
if self.distributed.local_rank() == 0:
# Checkpoints need to be written on every rank, but the directory only needs to be created once per node.
super().prepare(directory)
def load(self, save_path: str, device: torch.device | None = None) -> bool:
"""Load a checkpoint.
For DeepSpeed, we need every worker to independently load back the model weights, as the checkpoints themselves
may be sharded (when using DeepSpeed Zero3).
https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#loading-training-checkpoints
"""
# NOTE(geoffrey): `load_module_strict=False` because this code path is frequently used to load models trained
# using adapter-based fine-tuning, where the checkpoints only contain the adapter weights, and not the full
# model weights. This may lead to silent, unexpected behavior for resuming full model fine-tuning,
# where all the model weights *must* be loaded in.
# TODO(geoffrey): Add a boolean arg to function to control load_module_strict behavior.
_, client_state = self.model.load_checkpoint(
save_path, load_lr_scheduler_states=False, load_module_strict=False
)
self.global_step = self._get_global_step(client_state, save_path)
if self.scheduler is not None and "scheduler_state" in client_state:
self.scheduler.load_state_dict(client_state["scheduler_state"])
return True
def save(self, save_path: str, global_step: int):
client_state = {
"global_step": global_step,
}
if self.scheduler is not None:
client_state["scheduler_state"] = self.scheduler.state_dict()
kwargs = {}
if _deepspeed_0101:
kwargs["exclude_frozen_parameters"] = True
self.model.save_checkpoint(save_path, client_state=client_state, **kwargs)
def get_state_for_inference(self, save_path: str, device: torch.device | None = None) -> Mapping[str, Any]:
if self.model.zero_optimization_stage() == 3:
return get_fp32_state_dict_from_zero_checkpoint(save_path)
self.model.load_checkpoint(
save_path, load_optimizer_states=False, load_lr_scheduler_states=False, load_module_only=True
)
return self.model.module.cpu().state_dict()
================================================
FILE: ludwig/distributed/fsdp.py
================================================
import logging
from typing import TYPE_CHECKING
import torch
from torch import nn
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.optim import Optimizer
from ludwig.distributed.ddp import DDPStrategy
from ludwig.modules.optimization_modules import create_optimizer
if TYPE_CHECKING:
from ludwig.schema.trainer import ECDTrainerConfig
class FSDPStrategy(DDPStrategy):
def _log_on_init(self):
logging.info("Using FSDP strategy")
def prepare(
self,
model: nn.Module,
trainer_config: "ECDTrainerConfig",
base_learning_rate: float,
) -> tuple[nn.Module, Optimizer]:
return FSDP(model), create_optimizer(model, trainer_config.optimizer, base_learning_rate)
def to_device(self, model: nn.Module, device: torch.device | None = None) -> nn.Module:
return model
@classmethod
def is_model_parallel(cls) -> bool:
return True
================================================
FILE: ludwig/encoders/__init__.py
================================================
# register all encoders
import ludwig.encoders.bag_encoders
import ludwig.encoders.category_encoders
import ludwig.encoders.date_encoders
import ludwig.encoders.generic_encoders
import ludwig.encoders.h3_encoders
import ludwig.encoders.image
import ludwig.encoders.sequence_encoders
import ludwig.encoders.set_encoders
import ludwig.encoders.text_encoders # noqa
================================================
FILE: ludwig/encoders/bag_encoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from typing import Any
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BAG, ENCODER_OUTPUT
from ludwig.encoders.base import Encoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.modules.embedding_modules import EmbedWeighted
from ludwig.modules.fully_connected_modules import FCStack
from ludwig.schema.encoders.bag_encoders import BagEmbedWeightedConfig
from ludwig.schema.encoders.base import BaseEncoderConfig
logger = logging.getLogger(__name__)
@DeveloperAPI
@register_encoder("embed", BAG)
class BagEmbedWeightedEncoder(Encoder):
def __init__(
self,
vocab: list[str],
embedding_size: int = 50,
representation: str = "dense",
embeddings_trainable: bool = True,
pretrained_embeddings: str | None = None,
force_embedding_size: bool = False,
embeddings_on_cpu: bool = False,
fc_layers=None,
num_fc_layers: int = 0,
output_size: int = 10,
use_bias: bool = True,
weights_initializer: str = "xavier_uniform",
bias_initializer: str = "zeros",
norm: str | None = None,
norm_params: dict[str, Any] | None = None,
activation: str = "relu",
dropout: float = 0.0,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
logger.debug(" EmbedWeighted")
self.embed_weighted = EmbedWeighted(
vocab,
embedding_size,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
force_embedding_size=force_embedding_size,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" FCStack")
self.fc_stack = FCStack(
self.embed_weighted.output_shape[-1],
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return BagEmbedWeightedConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([len(self.vocab)])
@property
def output_shape(self) -> torch.Size:
return self.fc_stack.output_shape
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
:param inputs: The inputs fed into the encoder.
Shape: [batch x vocab size], type torch.int32
:param return: embeddings of shape [batch x embed size], type torch.float32
"""
hidden = self.embed_weighted(inputs)
hidden = self.fc_stack(hidden)
return {ENCODER_OUTPUT: hidden}
================================================
FILE: ludwig/encoders/base.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from abc import ABC, abstractmethod
from torch import nn
from ludwig.api_annotations import DeveloperAPI
from ludwig.utils.torch_utils import LudwigModule
@DeveloperAPI
class Encoder(LudwigModule, ABC):
@abstractmethod
def forward(self, inputs, training=None, mask=None):
raise NotImplementedError
def get_embedding_layer(self) -> nn.Module:
"""Returns layer that embeds inputs, used for computing explanations.
Captum adds an evaluation hook to this module returned by this function. The hook copies the module's return
with .clone(). The module returned by this function must return a tensor, not a dictionary of tensors.
"""
return next(self.children())
@property
def name(self) -> str:
return self.__class__.__name__
================================================
FILE: ludwig/encoders/category_encoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from torch import nn
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import CATEGORY, ENCODER_OUTPUT
from ludwig.encoders.base import Encoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.modules.embedding_modules import Embed
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.category_encoders import (
CategoricalEmbedConfig,
CategoricalOneHotEncoderConfig,
CategoricalPassthroughEncoderConfig,
CategoricalSparseConfig,
)
logger = logging.getLogger(__name__)
@DeveloperAPI
@register_encoder("passthrough", [CATEGORY])
class CategoricalPassthroughEncoder(Encoder):
def __init__(self, input_size=1, encoder_config=None, **kwargs):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.input_size = input_size
self.identity = nn.Identity()
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The inputs fed into the encoder.
Shape: [batch x 1]
"""
return {"encoder_output": self.identity(inputs.float())}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return CategoricalPassthroughEncoderConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
return self.input_shape
def get_embedding_layer(self) -> nn.Module:
return self.identity
@DeveloperAPI
@register_encoder("dense", CATEGORY)
class CategoricalEmbedEncoder(Encoder):
def __init__(
self,
vocab: list[str],
embedding_size: int = 50,
embeddings_trainable: bool = True,
pretrained_embeddings: str | None = None,
embeddings_on_cpu: bool = False,
dropout: float = 0.0,
embedding_initializer: str | dict | None = None,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
logger.debug(" Embed")
self.embed = Embed(
vocab=vocab,
embedding_size=embedding_size,
representation="dense",
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=embedding_initializer,
)
self.embedding_size = self.embed.embedding_size
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
:param inputs: The inputs fed into the encoder.
Shape: [batch x 1], type torch.int32
:param return: embeddings of shape [batch x embed size], type torch.float32
"""
embedded = self.embed(inputs)
return {ENCODER_OUTPUT: embedded}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return CategoricalEmbedConfig
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.embedding_size])
@property
def input_shape(self) -> torch.Size:
return torch.Size([1])
@DeveloperAPI
@register_encoder("sparse", CATEGORY)
class CategoricalSparseEncoder(Encoder):
def __init__(
self,
vocab: list[str],
embeddings_trainable: bool = False,
pretrained_embeddings: str | None = None,
embeddings_on_cpu: bool = False,
dropout: float = 0.0,
embedding_initializer: str | dict | None = None,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
logger.debug(" Embed")
self.embed = Embed(
vocab=vocab,
embedding_size=len(vocab),
representation="sparse",
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=embedding_initializer,
)
self.embedding_size = self.embed.embedding_size
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
:param inputs: The inputs fed into the encoder.
Shape: [batch x 1], type torch.int32
:param return: embeddings of shape [batch x embed size], type torch.float32
"""
embedded = self.embed(inputs)
return {ENCODER_OUTPUT: embedded}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return CategoricalSparseConfig
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.embedding_size])
@property
def input_shape(self) -> torch.Size:
return torch.Size([1])
@DeveloperAPI
@register_encoder("onehot", [CATEGORY])
class CategoricalOneHotEncoder(Encoder):
def __init__(
self,
vocab: list[str],
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.vocab_size = len(vocab)
self.identity = nn.Identity()
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The inputs fed into the encoder.
Shape: [batch, 1] or [batch]
"""
t = inputs.reshape(-1).long()
# the output of this must be a float so that it can be concatenated with other
# encoder outputs and passed to dense layers in the combiner, decoder, etc.
outputs = self.identity(torch.nn.functional.one_hot(t, num_classes=self.vocab_size).float())
return {"encoder_output": outputs}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return CategoricalOneHotEncoderConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([1])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.vocab_size])
def get_embedding_layer(self) -> nn.Module:
return self.identity
================================================
FILE: ludwig/encoders/date_encoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DATE, ENCODER_OUTPUT
from ludwig.encoders.base import Encoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.modules.embedding_modules import Embed
from ludwig.modules.fully_connected_modules import FCStack
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.date_encoders import DateEmbedConfig, DateWaveConfig
from ludwig.utils import torch_utils
logger = logging.getLogger(__name__)
# Year, month, day, weekday, yearday, hour, minute, seconds, second_of_day.
# TODO: Share this constant with date_feature.DATE_VECTOR_SIZE.
DATE_INPUT_SIZE = 9
@DeveloperAPI
@register_encoder("embed", DATE)
class DateEmbed(Encoder):
def __init__(
self,
embedding_size: int = 10,
embeddings_on_cpu: bool = False,
fc_layers: list[dict] | None = None,
num_fc_layers: int = 0,
output_size: int = 10,
use_bias: bool = True,
weights_initializer: str = "xavier_uniform",
bias_initializer: str = "zeros",
norm: str | None = None,
norm_params: dict | None = None,
activation: str = "relu",
dropout: float = 0,
encoder_config=None,
**kwargs,
):
"""
:param embedding_size: The maximum embedding size, the actual size
will be `min(vocabulary_size, embedding_size)` for `dense`
representations and exactly `vocabulary_size` for the `sparse`
encoding, where `vocabulary_size` is the number of different
strings appearing in the training set in the column the feature
is named after (plus 1 for ``).
:type embedding_size: Integer
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows for faster access,
but in some cases the embedding matrix may be really big and
this parameter forces the placement of the embedding matrix in
regular memory and the CPU is used to resolve them, slightly
slowing down the process as a result of data transfer between
CPU and GPU memory.
:param fc_layers: list of dictionaries containing the parameters of
all the fully connected layers.
:type fc_layers: List
:param num_fc_layers: Number of stacked fully connected layers.
:type num_fc_layers: Integer
:param output_size: Size of each layer.
:type output_size: Integer
:param use_bias: bool determines where to use a bias vector.
:type use_bias: bool
:param weights_initializer: Initializer for the weights (aka kernel)
matrix.
:type weights_initializer: string
:param bias_initializer: Initializer for the bias vector.
:type bias_initializer: string
:param norm: type of normalization to use 'batch' or 'layer'.
:type norm: string, default None
:param norm_params: parameters to pass to normalization function.
:type norm_params: dictionary
:param activation: Activation function to use.
:type activation: string
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: float
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
logger.debug(" year FCStack")
self.year_fc = FCStack(
first_layer_input_size=1,
num_layers=1,
default_output_size=1,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=None,
default_norm_params=None,
default_activation=None,
default_dropout=dropout,
)
logger.debug(" month Embed")
self.embed_month = Embed(
[str(i) for i in range(12)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" day Embed")
self.embed_day = Embed(
[str(i) for i in range(31)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" weekday Embed")
self.embed_weekday = Embed(
[str(i) for i in range(7)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" yearday Embed")
self.embed_yearday = Embed(
[str(i) for i in range(366)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" hour Embed")
self.embed_hour = Embed(
[str(i) for i in range(24)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" minute Embed")
self.embed_minute = Embed(
[str(i) for i in range(60)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" second Embed")
self.embed_second = Embed(
[str(i) for i in range(60)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
# Summed sizes of all of the embeddings.
fc_layer_input_size = (
self.year_fc.output_shape[0]
+ self.embed_month.output_shape[0]
+ self.embed_day.output_shape[0]
+ self.embed_weekday.output_shape[0]
+ self.embed_yearday.output_shape[0]
+ self.embed_hour.output_shape[0]
+ self.embed_minute.output_shape[0]
+ self.embed_second.output_shape[0]
+ 1 # for periodic_second_of_day.
)
logger.debug(" FCStack")
self.fc_stack = FCStack(
first_layer_input_size=fc_layer_input_size,
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
:param inputs: The input vector fed into the encoder.
Shape: [batch x DATE_INPUT_SIZE], type torch.int8
:type inputs: Tensor
"""
# ================ Embeddings ================
input_vector = inputs.type(torch.int)
scaled_year = self.year_fc(input_vector[:, 0:1].type(torch.float))
embedded_month = self.embed_month(input_vector[:, 1:2] - 1)
embedded_day = self.embed_day(input_vector[:, 2:3] - 1)
embedded_weekday = self.embed_weekday(input_vector[:, 3:4])
embedded_yearday = self.embed_yearday(input_vector[:, 4:5] - 1)
embedded_hour = self.embed_hour(input_vector[:, 5:6])
embedded_minute = self.embed_minute(input_vector[:, 6:7])
embedded_second = self.embed_second(input_vector[:, 7:8])
periodic_second_of_day = torch_utils.periodic(input_vector[:, 8:9].type(torch.float), 86400)
hidden = torch.cat(
[
scaled_year,
embedded_month,
embedded_day,
embedded_weekday,
embedded_yearday,
embedded_hour,
embedded_minute,
embedded_second,
periodic_second_of_day,
],
dim=1,
)
# ================ FC Stack ================
# logger.debug(' flatten hidden: {0}'.format(hidden))
hidden = self.fc_stack(hidden)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return DateEmbedConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([DATE_INPUT_SIZE])
@property
def output_shape(self) -> torch.Size:
return self.fc_stack.output_shape
@DeveloperAPI
@register_encoder("wave", DATE)
class DateWave(Encoder):
def __init__(
self,
fc_layers: list[FCStack] | None = None,
num_fc_layers: int = 1,
output_size: int = 10,
use_bias: bool = True,
weights_initializer: str = "xavier_uniform",
bias_initializer: str = "zeros",
norm: str | None = None,
norm_params: dict | None = None,
activation: str = "relu",
dropout: float = 0,
encoder_config=None,
**kwargs,
):
"""
:param fc_layers: list of dictionaries containing the parameters of
all the fully connected layers.
:type fc_layers: List
:param num_fc_layers: Number of stacked fully connected layers.
:type num_fc_layers: Integer
:param output_size: Size of each layer.
:type output_size: Integer
:param use_bias: bool determines where to use a bias vector.
:type use_bias: bool
:param weights_initializer: Initializer for the weights (aka kernel)
matrix.
:type weights_initializer: string
:param bias_initializer: Initializer for the bias vector.
:type bias_initializer: string
:param norm: type of normalization to use 'batch' or 'layer'.
:type norm: string, default None
:param norm_params: parameters to pass to normalization function.
:type norm_params: dictionary
:param activation: Activation function to use.
:type activation: string
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: float
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
logger.debug(" year FCStack")
self.year_fc = FCStack(
first_layer_input_size=1,
num_layers=1,
default_output_size=1,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=None,
default_norm_params=None,
default_activation=None,
default_dropout=dropout,
)
# Summed sizes of all of the embeddings.
# Additional 8 for periodic_[month, day, ..., second_of_day].
fc_layer_input_size = self.year_fc.output_shape[0] + 8
logger.debug(" FCStack")
self.fc_stack = FCStack(
first_layer_input_size=fc_layer_input_size,
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
:param inputs: The input vector fed into the encoder.
Shape: [batch x DATE_INPUT_SIZE], type torch.int8
:type inputs: Tensor
"""
# ================ Embeddings ================
input_vector = inputs.type(torch.float)
scaled_year = self.year_fc(input_vector[:, 0:1])
periodic_month = torch_utils.periodic(input_vector[:, 1:2], 12)
periodic_day = torch_utils.periodic(input_vector[:, 2:3], 31)
periodic_weekday = torch_utils.periodic(input_vector[:, 3:4], 7)
periodic_yearday = torch_utils.periodic(input_vector[:, 4:5], 366)
periodic_hour = torch_utils.periodic(input_vector[:, 5:6], 24)
periodic_minute = torch_utils.periodic(input_vector[:, 6:7], 60)
periodic_second = torch_utils.periodic(input_vector[:, 7:8], 60)
periodic_second_of_day = torch_utils.periodic(input_vector[:, 8:9], 86400)
hidden = torch.cat(
[
scaled_year,
periodic_month,
periodic_day,
periodic_weekday,
periodic_yearday,
periodic_hour,
periodic_minute,
periodic_second,
periodic_second_of_day,
],
dim=1,
)
# ================ FC Stack ================
# logger.debug(' flatten hidden: {0}'.format(hidden))
hidden = self.fc_stack(hidden)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return DateWaveConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([DATE_INPUT_SIZE])
@property
def output_shape(self) -> torch.Size:
return self.fc_stack.output_shape
================================================
FILE: ludwig/encoders/generic_encoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BINARY, ENCODER_OUTPUT, NUMBER, TEXT, TIMESERIES, VECTOR
from ludwig.encoders.base import Encoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.modules.fully_connected_modules import FCStack
from ludwig.schema.encoders.base import BaseEncoderConfig, DenseEncoderConfig, PassthroughEncoderConfig
logger = logging.getLogger(__name__)
@DeveloperAPI
@register_encoder("passthrough", [BINARY, NUMBER, TEXT, VECTOR])
class PassthroughEncoder(Encoder):
def __init__(self, input_size=1, encoder_config=None, **kwargs):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.input_size = input_size
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The inputs fed into the encoder.
Shape: [batch x 1], type tf.float32
"""
return {ENCODER_OUTPUT: inputs}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return PassthroughEncoderConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
return self.input_shape
@DeveloperAPI
@register_encoder("dense", [BINARY, NUMBER, VECTOR, TIMESERIES])
class DenseEncoder(Encoder):
def __init__(
self,
input_size,
fc_layers=None,
num_layers=1,
output_size=256,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
norm=None,
norm_params=None,
activation="relu",
dropout=0,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.input_size = input_size
logger.debug(" FCStack")
self.fc_stack = FCStack(
first_layer_input_size=input_size,
layers=fc_layers,
num_layers=num_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The inputs fed into the encoder.
Shape: [batch x 1], type tf.float32
"""
return {ENCODER_OUTPUT: self.fc_stack(inputs)}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return DenseEncoderConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.fc_stack.layers[-1]["output_size"]])
================================================
FILE: ludwig/encoders/h3_encoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ENCODER_OUTPUT, ENCODER_OUTPUT_STATE, H3
from ludwig.encoders.base import Encoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.modules.embedding_modules import Embed, EmbedSequence
from ludwig.modules.fully_connected_modules import FCStack
from ludwig.modules.initializer_modules import get_initializer
from ludwig.modules.recurrent_modules import RecurrentStack
from ludwig.modules.reduction_modules import SequenceReducer
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.h3_encoders import H3EmbedConfig, H3RNNConfig, H3WeightedSumConfig
from ludwig.utils import torch_utils
logger = logging.getLogger(__name__)
# TODO: Share this with h3_feature.H3_VECTOR_LENGTH
H3_INPUT_SIZE = 19
@DeveloperAPI
@register_encoder("embed", H3)
class H3Embed(Encoder):
def __init__(
self,
embedding_size: int = 10,
embeddings_on_cpu: bool = False,
fc_layers: list | None = None,
num_fc_layers: int = 0,
output_size: int = 10,
use_bias: bool = True,
weights_initializer: str = "xavier_uniform",
bias_initializer: str = "zeros",
norm: str = None,
norm_params: dict = None,
activation: str = "relu",
dropout: float = 0,
reduce_output: str = "sum",
encoder_config=None,
**kwargs,
):
"""
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for
``).
:type embedding_size: Integer
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memory and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: Boolean
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.embedding_size = embedding_size
self.reduce_output = reduce_output
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
logger.debug(" mode Embed")
self.embed_mode = Embed(
[str(i) for i in range(3)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
force_embedding_size=True,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" edge Embed")
self.embed_edge = Embed(
[str(i) for i in range(7)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
force_embedding_size=True,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" resolution Embed")
self.embed_resolution = Embed(
[str(i) for i in range(16)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
force_embedding_size=True,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" base cell Embed")
self.embed_base_cell = Embed(
[str(i) for i in range(122)],
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
force_embedding_size=True,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" cells Embed")
self.embed_cells = EmbedSequence(
[str(i) for i in range(8)],
embedding_size,
max_sequence_length=(H3_INPUT_SIZE - 4),
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
force_embedding_size=True,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" FCStack")
self.fc_stack = FCStack(
first_layer_input_size=embedding_size,
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
:param inputs: The input vector fed into the encoder.
Shape: [batch x H3_INPUT_SIZE], type torch.int8
:type inputs: Tensor
"""
input_vector = inputs.int()
# ================ Embeddings ================
embedded_mode = self.embed_mode(input_vector[:, 0:1]).unsqueeze(1)
embedded_edge = self.embed_edge(input_vector[:, 1:2]).unsqueeze(1)
embedded_resolution = self.embed_resolution(input_vector[:, 2:3]).unsqueeze(1)
embedded_base_cell = self.embed_base_cell(input_vector[:, 3:4]).unsqueeze(1)
embedded_cells = self.embed_cells(input_vector[:, 4:])
# ================ Masking ================
# Mask out cells beyond the resolution of interest.
resolution = input_vector[:, 2]
mask = torch.unsqueeze(torch_utils.sequence_mask(resolution, 15), dim=-1).float()
# Batch size X 15(max resolution) X embedding size
masked_embedded_cells = embedded_cells * mask
# ================ Reduce ================
# Batch size X H3_INPUT_SIZE X embedding size
concatenated = torch.cat(
[embedded_mode, embedded_edge, embedded_resolution, embedded_base_cell, masked_embedded_cells], dim=1
)
hidden = self.reduce_sequence(concatenated)
# ================ FC Stack ================
# logger.debug(' flatten hidden: {0}'.format(hidden))
hidden = self.fc_stack(hidden)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return H3EmbedConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([H3_INPUT_SIZE])
@property
def output_shape(self) -> torch.Size:
return self.fc_stack.output_shape
@DeveloperAPI
@register_encoder("weighted_sum", H3)
class H3WeightedSum(Encoder):
def __init__(
self,
embedding_size: int = 10,
embeddings_on_cpu: bool = False,
should_softmax: bool = False,
fc_layers: list | None = None,
num_fc_layers: int = 0,
output_size: int = 10,
use_bias: bool = True,
weights_initializer: str = "xavier_uniform",
bias_initializer: str = "zeros",
norm: str | None = None,
norm_params: dict = None,
activation: str = "relu",
dropout: float = 0,
encoder_config=None,
**kwargs,
):
"""
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for
``).
:type embedding_size: Integer
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memory and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: Boolean
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.should_softmax = should_softmax
self.sum_sequence_reducer = SequenceReducer(reduce_mode="sum")
self.h3_embed = H3Embed(
embedding_size,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
weights_initializer=weights_initializer,
bias_initializer=bias_initializer,
reduce_output="None",
)
self.register_buffer(
"aggregation_weights", torch.Tensor(get_initializer(weights_initializer)([H3_INPUT_SIZE, 1]))
)
logger.debug(" FCStack")
self.fc_stack = FCStack(
first_layer_input_size=self.h3_embed.output_shape[0],
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
:param inputs: The input vector fed into the encoder.
Shape: [batch x H3_INPUT_SIZE], type torch.int8
:type inputs: Tensor
"""
# ================ Embeddings ================
input_vector = inputs
embedded_h3 = self.h3_embed(input_vector)
# ================ Weighted Sum ================
if self.should_softmax:
weights = torch.softmax(self.aggregation_weights, dim=None)
else:
weights = self.aggregation_weights
hidden = self.sum_sequence_reducer(embedded_h3[ENCODER_OUTPUT] * weights)
# ================ FC Stack ================
# logger.debug(' flatten hidden: {0}'.format(hidden))
hidden = self.fc_stack(hidden)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return H3WeightedSumConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([H3_INPUT_SIZE])
@property
def output_shape(self) -> torch.Size:
return self.fc_stack.output_shape
@DeveloperAPI
@register_encoder("rnn", H3)
class H3RNN(Encoder):
def __init__(
self,
embedding_size: int = 10,
embeddings_on_cpu: bool = False,
num_layers: int = 1,
hidden_size: int = 10,
cell_type: str = "rnn",
bidirectional: bool = False,
activation: str = "tanh",
recurrent_activation: str = "sigmoid",
use_bias: bool = True,
unit_forget_bias: bool = True,
weights_initializer: str = "xavier_uniform",
recurrent_initializer: str = "orthogonal",
bias_initializer: str = "zeros",
dropout: float = 0.0,
recurrent_dropout: float = 0.0,
reduce_output: str = "last",
encoder_config=None,
**kwargs,
):
"""
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for
``).
:type embedding_size: Integer
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memory and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:param num_layers: the number of stacked recurrent layers.
:type num_layers: Integer
:param cell_type: the type of recurrent cell to use.
Available values are: `rnn`, `lstm`, `lstm_block`, `lstm`,
`ln`, `lstm_cudnn`, `gru`, `gru_block`, `gru_cudnn`.
For reference about the differences between the cells please
refer to PyTorch's documentation. We suggest to use the
`block` variants on CPU and the `cudnn` variants on GPU
because of their increased speed.
:type cell_type: str
:param hidden_size: the size of the state of the rnn.
:type hidden_size: Integer
:param bidirectional: if `True` two recurrent networks will perform
encoding in the forward and backward direction and
their outputs will be concatenated.
:type bidirectional: Boolean
:param activation: Activation function to use.
:type activation: string
:param recurrent_activation: Activation function to use for the
recurrent step.
:type recurrent_activation: string
:param use_bias: bool determines where to use a bias vector
:type use_bias: bool
:param unit_forget_bias: if True add 1 to the bias forget gate at
initialization.
:type unit_forget_bias: bool
:param weights_initializer: Initializer for the weights (aka kernel)
matrix
:type weights_initializer: string
:param recurrent_initializer: Initializer for the recurrent weights
matrix
:type recurrent_initializer: string
:param bias_initializer: Initializer for the bias vector
:type bias_initializer: string
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: float
:param recurrent_dropout: Dropout rate for the RNN encoder of the H3 embeddings.
:type recurrent_dropout: float
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.embedding_size = embedding_size
self.h3_embed = H3Embed(
embedding_size,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
weights_initializer=weights_initializer,
bias_initializer=bias_initializer,
reduce_output="None",
)
logger.debug(" RecurrentStack")
self.recurrent_stack = RecurrentStack(
input_size=self.h3_embed.output_shape[0],
max_sequence_length=H3_INPUT_SIZE,
hidden_size=hidden_size,
cell_type=cell_type,
num_layers=num_layers,
bidirectional=bidirectional,
use_bias=use_bias,
dropout=recurrent_dropout,
)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
:param inputs: The input vector fed into the encoder.
Shape: [batch x H3_INPUT_SIZE], type torch.int8
:type inputs: Tensor
"""
# ================ Embeddings ================
embedded_h3 = self.h3_embed(inputs)
# ================ RNN ================
hidden, final_state = self.recurrent_stack(embedded_h3[ENCODER_OUTPUT])
return {ENCODER_OUTPUT: hidden, ENCODER_OUTPUT_STATE: final_state}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return H3RNNConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([H3_INPUT_SIZE])
@property
def output_shape(self) -> torch.Size:
return self.recurrent_stack.output_shape
================================================
FILE: ludwig/encoders/image/__init__.py
================================================
import ludwig.encoders.image.base
import ludwig.encoders.image.timm # noqa
import ludwig.encoders.image.torchvision # noqa
================================================
FILE: ludwig/encoders/image/base.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from typing import Any
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ENCODER_OUTPUT, ENCODER_OUTPUT_STATE, IMAGE
from ludwig.encoders.base import Encoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.modules.convolutional_modules import Conv2DStack, ResNet, UNetDownStack
from ludwig.modules.fully_connected_modules import FCStack
from ludwig.modules.mlp_mixer_modules import MLPMixer
from ludwig.schema.encoders.image.base import (
ImageEncoderConfig,
MLPMixerConfig,
ResNetConfig,
Stacked2DCNNConfig,
UNetEncoderConfig,
ViTConfig,
)
from ludwig.utils.torch_utils import FreezeModule
logger = logging.getLogger(__name__)
@DeveloperAPI
class ImageEncoder(Encoder):
pass
@DeveloperAPI
@register_encoder("stacked_cnn", IMAGE)
class Stacked2DCNN(ImageEncoder):
def __init__(
self,
height: int,
width: int,
conv_layers: list[dict] | None = None,
num_conv_layers: int | None = None,
num_channels: int = None,
out_channels: int = 32,
kernel_size: int | tuple[int] = 3,
stride: int | tuple[int] = 1,
padding: int | tuple[int] | str = "valid",
dilation: int | tuple[int] = 1,
conv_use_bias: bool = True,
padding_mode: str = "zeros",
conv_norm: str | None = None,
conv_norm_params: dict[str, Any] | None = None,
conv_activation: str = "relu",
conv_dropout: int = 0,
pool_function: str = "max",
pool_kernel_size: int | tuple[int] = 2,
pool_stride: int | tuple[int] = None,
pool_padding: int | tuple[int] = 0,
pool_dilation: int | tuple[int] = 1,
groups: int = 1,
fc_layers: list[dict] | None = None,
num_fc_layers: int | None = 1,
output_size: int = 128,
fc_use_bias: bool = True,
fc_weights_initializer: str = "xavier_uniform",
fc_bias_initializer: str = "zeros",
fc_norm: str | None = None,
fc_norm_params: dict[str, Any] | None = None,
fc_activation: str = "relu",
fc_dropout: float = 0,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
# map parameter input feature config names to internal names
img_height = height
img_width = width
first_in_channels = num_channels
self._input_shape = (first_in_channels, img_height, img_width)
if first_in_channels is None:
raise ValueError("first_in_channels must not be None.")
logger.debug(" Conv2DStack")
self.conv_stack_2d = Conv2DStack(
img_height=img_height,
img_width=img_width,
layers=conv_layers,
num_layers=num_conv_layers,
first_in_channels=first_in_channels,
default_out_channels=out_channels,
default_kernel_size=kernel_size,
default_stride=stride,
default_padding=padding,
default_dilation=dilation,
default_groups=groups,
default_use_bias=conv_use_bias,
default_padding_mode=padding_mode,
default_norm=conv_norm,
default_norm_params=conv_norm_params,
default_activation=conv_activation,
default_dropout=conv_dropout,
default_pool_function=pool_function,
default_pool_kernel_size=pool_kernel_size,
default_pool_stride=pool_stride,
default_pool_padding=pool_padding,
default_pool_dilation=pool_dilation,
)
out_channels, img_height, img_width = self.conv_stack_2d.output_shape
first_fc_layer_input_size = out_channels * img_height * img_width
self.flatten = torch.nn.Flatten()
logger.debug(" FCStack")
self.fc_stack = FCStack(
first_layer_input_size=first_fc_layer_input_size,
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=fc_use_bias,
default_weights_initializer=fc_weights_initializer,
default_bias_initializer=fc_bias_initializer,
default_norm=fc_norm,
default_norm_params=fc_norm_params,
default_activation=fc_activation,
default_dropout=fc_dropout,
)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
:param inputs: The inputs fed into the encoder.
Shape: [batch x channels x height x width], type torch.uint8
"""
hidden = self.conv_stack_2d(inputs)
hidden = self.flatten(hidden)
outputs = self.fc_stack(hidden)
return {ENCODER_OUTPUT: outputs}
@staticmethod
def get_schema_cls() -> type[ImageEncoderConfig]:
return Stacked2DCNNConfig
@property
def output_shape(self) -> torch.Size:
return self.fc_stack.output_shape
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
@DeveloperAPI
@register_encoder("_resnet_legacy", IMAGE)
class ResNetEncoder(ImageEncoder):
def __init__(
self,
height: int,
width: int,
resnet_size: int = 50,
num_channels: int = 3,
out_channels: int = 16,
kernel_size: int | tuple[int] = 3,
conv_stride: int | tuple[int] = 1,
first_pool_kernel_size: int | tuple[int] = None,
first_pool_stride: int | tuple[int] = None,
batch_norm_momentum: float = 0.1,
batch_norm_epsilon: float = 0.001,
fc_layers: list[dict] | None = None,
num_fc_layers: int | None = 1,
output_size: int = 256,
use_bias: bool = True,
weights_initializer: str = "xavier_uniform",
bias_initializer: str = "zeros",
norm: str | None = None,
norm_params: dict[str, Any] | None = None,
activation: str = "relu",
dropout: float = 0,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
# map parameter input feature config names to internal names
img_height = height
img_width = width
first_in_channels = num_channels
self._input_shape = (first_in_channels, img_height, img_width)
logger.debug(" ResNet")
self.resnet = ResNet(
img_height=img_height,
img_width=img_width,
first_in_channels=first_in_channels,
out_channels=out_channels,
resnet_size=resnet_size,
kernel_size=kernel_size,
conv_stride=conv_stride,
first_pool_kernel_size=first_pool_kernel_size,
first_pool_stride=first_pool_stride,
batch_norm_momentum=batch_norm_momentum,
batch_norm_epsilon=batch_norm_epsilon,
)
first_fc_layer_input_size = self.resnet.output_shape[0]
logger.debug(" FCStack")
self.fc_stack = FCStack(
first_layer_input_size=first_fc_layer_input_size,
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
hidden = self.resnet(inputs)
axes = [2, 3]
hidden = torch.mean(hidden, axes)
hidden = self.fc_stack(hidden)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[ImageEncoderConfig]:
return ResNetConfig
@property
def output_shape(self) -> torch.Size:
return self.fc_stack.output_shape
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
@DeveloperAPI
@register_encoder("mlp_mixer", IMAGE)
class MLPMixerEncoder(ImageEncoder):
def __init__(
self,
height: int,
width: int,
num_channels: int = None,
patch_size: int = 16,
embed_size: int = 512,
token_size: int = 2048,
channel_dim: int = 256,
num_layers: int = 8,
dropout: float = 0.0,
avg_pool: bool = True,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
# map parameter input feature config names to internal names
img_height = height
img_width = width
in_channels = num_channels
if num_channels is None:
raise RuntimeError("num_channels must not be None")
self._input_shape = (in_channels, img_height, img_width)
logger.debug(" MLPMixer")
self.mlp_mixer = MLPMixer(
img_height=img_height,
img_width=img_width,
in_channels=in_channels,
patch_size=patch_size,
embed_size=embed_size,
token_size=token_size,
channel_dim=channel_dim,
num_layers=num_layers,
dropout=dropout,
avg_pool=avg_pool,
)
self._output_shape = self.mlp_mixer.output_shape
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
hidden = self.mlp_mixer(inputs)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[ImageEncoderConfig]:
return MLPMixerConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
@property
def output_shape(self) -> torch.Size:
return self._output_shape
@DeveloperAPI
@register_encoder("_vit_legacy", IMAGE)
class ViTEncoder(ImageEncoder):
def __init__(
self,
height: int,
width: int,
num_channels: int = 3,
use_pretrained: bool = True,
pretrained_model: str = "google/vit-base-patch16-224",
saved_weights_in_checkpoint: bool = False,
hidden_size: int = 768,
num_hidden_layers: int = 12,
num_attention_heads: int = 12,
intermediate_size: int = 3072,
hidden_act: str = "gelu",
hidden_dropout_prob: float = 0.1,
attention_probs_dropout_prob: float = 0.1,
initializer_range: float = 0.02,
layer_norm_eps: float = 1e-12,
gradient_checkpointing: bool = False,
patch_size: int = 16,
trainable: bool = True,
encoder_config=None,
**kwargs,
):
"""Creates a ViT encoder using transformers.ViTModel.
use_pretrained: If True, uses a pretrained transformer based on the
pretrained_model argument.
pretrained: If str, expects the path to a pretrained model or the id of
a model on huggingface.co, and ignores the configuration provided in
the arguments.
"""
super().__init__()
self.config = encoder_config
try:
from transformers import ViTConfig, ViTModel
except ModuleNotFoundError:
raise RuntimeError(
" transformers is not installed. "
"In order to install all image feature dependencies run "
"pip install ludwig[image]"
)
# map parameter input feature config names to internal names
img_height = height
img_width = width
in_channels = num_channels
img_width = img_width or img_height
if img_width != img_height:
raise ValueError("img_height and img_width should be identical.")
self._input_shape = (in_channels, img_height, img_width)
config_dict: dict
if use_pretrained and not saved_weights_in_checkpoint:
config_dict = {
"pretrained_model_name_or_path": pretrained_model,
}
transformer = ViTModel.from_pretrained(**config_dict)
else:
config_dict = {
"image_size": img_height,
"num_channels": in_channels,
"patch_size": patch_size,
"hidden_size": hidden_size,
"num_hidden_layers": num_hidden_layers,
"num_attention_heads": num_attention_heads,
"intermediate_size": intermediate_size,
"hidden_act": hidden_act,
"hidden_dropout_prob": hidden_dropout_prob,
"attention_probs_dropout_prob": attention_probs_dropout_prob,
"initializer_range": initializer_range,
"layer_norm_eps": layer_norm_eps,
"gradient_checkpointing": gradient_checkpointing,
}
config = ViTConfig(**config_dict)
transformer = ViTModel(config)
self.transformer = FreezeModule(transformer, frozen=not trainable)
self._output_shape = (transformer.config.hidden_size,)
def forward(self, inputs: torch.Tensor, head_mask: torch.Tensor | None = None) -> EncoderOutputDict:
output = self.transformer.module(inputs, head_mask=head_mask)
return {ENCODER_OUTPUT: output.pooler_output}
@staticmethod
def get_schema_cls() -> type[ImageEncoderConfig]:
return ViTConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@DeveloperAPI
@register_encoder("unet", IMAGE)
class UNetEncoder(ImageEncoder):
def __init__(
self,
height: int,
width: int,
num_channels: int = 3,
conv_norm: str | None = None,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
if height % 16 or width % 16:
raise ValueError(f"Invalid `height` {height} or `width` {width} for unet encoder")
self.unet = UNetDownStack(
img_height=height,
img_width=width,
in_channels=num_channels,
norm=conv_norm,
)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
hidden, skips = self.unet(inputs)
return {ENCODER_OUTPUT: hidden, ENCODER_OUTPUT_STATE: skips}
@staticmethod
def get_schema_cls() -> type[ImageEncoderConfig]:
return UNetEncoderConfig
@property
def output_shape(self) -> torch.Size:
return self.unet.output_shape
@property
def input_shape(self) -> torch.Size:
return self.unet.input_shape
================================================
FILE: ludwig/encoders/image/timm.py
================================================
import logging
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ENCODER_OUTPUT, IMAGE
from ludwig.encoders.image.base import ImageEncoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.image.timm import (
TimmCAFormerEncoderConfig,
TimmConvFormerEncoderConfig,
TimmEncoderConfig,
TimmPoolFormerEncoderConfig,
)
logger = logging.getLogger(__name__)
def _get_timm():
try:
import timm
except ImportError:
raise ImportError("timm is required for this encoder. Install it with: pip install timm")
return timm
@DeveloperAPI
@register_encoder("timm", IMAGE)
class TimmEncoder(ImageEncoder):
"""Wraps any model from the timm (pytorch-image-models) library as a Ludwig image encoder.
This provides access to hundreds of pretrained vision models including MetaFormer variants
(CAFormer, ConvFormer, PoolFormer), ConvNeXt V2, EfficientFormer, and many more.
Usage in Ludwig config:
encoder:
type: timm
model_name: caformer_s18.sail_in22k_ft_in1k
use_pretrained: true
trainable: true
"""
def __init__(
self,
model_name: str = "caformer_s18",
use_pretrained: bool = True,
trainable: bool = True,
saved_weights_in_checkpoint: bool = False,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
timm = _get_timm()
pretrained = use_pretrained and not saved_weights_in_checkpoint
if pretrained:
logger.info(f"Instantiating timm image encoder '{model_name}' with pretrained weights.")
else:
logger.info(f"Instantiating timm image encoder '{model_name}' without pretrained weights.")
# num_classes=0 removes the classification head, returning pooled features
self.model = timm.create_model(model_name, pretrained=pretrained, num_classes=0)
# Get the model's expected input config for input_shape
data_config = timm.data.resolve_model_data_config(self.model)
self._input_size = data_config["input_size"] # (C, H, W)
# Compute output dim by running a dummy forward
with torch.no_grad():
dummy = torch.zeros(1, *self._input_size)
out = self.model(dummy)
self._output_dim = out.shape[-1]
for p in self.model.parameters():
p.requires_grad_(trainable)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
return {ENCODER_OUTPUT: self.model(inputs)}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TimmEncoderConfig
@property
def output_shape(self) -> torch.Size:
return torch.Size([self._output_dim])
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_size)
@DeveloperAPI
@register_encoder("caformer", IMAGE)
class TimmCAFormerEncoder(TimmEncoder):
"""CAFormer encoder — hybrid Conv+Attention MetaFormer achieving SOTA accuracy on ImageNet.
Variants: s18 (26M, 83.6%), s36 (39M, 84.5%), m36 (56M, 85.2%), b36 (99M, 85.5%).
"""
def __init__(self, model_name: str = "caformer_s18", **kwargs):
super().__init__(model_name=model_name, **kwargs)
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TimmCAFormerEncoderConfig
@DeveloperAPI
@register_encoder("convformer", IMAGE)
class TimmConvFormerEncoder(TimmEncoder):
"""ConvFormer encoder — pure CNN MetaFormer that outperforms ConvNeXt."""
def __init__(self, model_name: str = "convformer_s18", **kwargs):
super().__init__(model_name=model_name, **kwargs)
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TimmConvFormerEncoderConfig
@DeveloperAPI
@register_encoder("poolformer", IMAGE)
class TimmPoolFormerEncoder(TimmEncoder):
"""PoolFormer encoder — MetaFormer using simple average pooling as token mixer."""
def __init__(self, model_name: str = "poolformerv2_s12", **kwargs):
super().__init__(model_name=model_name, **kwargs)
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TimmPoolFormerEncoderConfig
================================================
FILE: ludwig/encoders/image/torchvision.py
================================================
import logging
import os
from abc import abstractmethod
import torch
import torchvision.models as tvm
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ENCODER_OUTPUT, IMAGE
from ludwig.encoders.image.base import ImageEncoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.image.torchvision import (
TVAlexNetEncoderConfig,
TVConvNeXtEncoderConfig,
TVDenseNetEncoderConfig,
TVEfficientNetEncoderConfig,
TVGoogLeNetEncoderConfig,
TVInceptionV3EncoderConfig,
TVMaxVitEncoderConfig,
TVMNASNetEncoderConfig,
TVMobileNetV2EncoderConfig,
TVMobileNetV3EncoderConfig,
TVRegNetEncoderConfig,
TVResNetEncoderConfig,
TVResNeXtEncoderConfig,
TVShuffleNetV2EncoderConfig,
TVSqueezeNetEncoderConfig,
TVSwinTransformerEncoderConfig,
TVVGGEncoderConfig,
TVViTEncoderConfig,
TVWideResNetEncoderConfig,
)
from ludwig.utils.image_utils import register_torchvision_model_variants, torchvision_model_registry, TVModelVariant
logger = logging.getLogger(__name__)
@DeveloperAPI
class TVBaseEncoder(ImageEncoder):
def __init__(
self,
model_variant: str | int = None,
use_pretrained: bool = True,
saved_weights_in_checkpoint: bool = False,
model_cache_dir: str | None = None,
trainable: bool = True,
**kwargs,
):
super().__init__()
logger.debug(f" {self.name}")
# map parameter input feature config names to internal names
self.model_variant = model_variant
self.use_pretrained = use_pretrained
self.model_cache_dir = model_cache_dir
# remove any Ludwig specific keyword parameters
kwargs.pop("encoder_config", None)
kwargs.pop("type", None)
kwargs.pop("skip", None)
# cache pre-trained models if requested
# based on https://github.com/pytorch/vision/issues/616#issuecomment-428637564
if self.model_cache_dir is not None:
os.environ["TORCH_HOME"] = self.model_cache_dir
# retrieve function to create requested model
self.create_model = torchvision_model_registry[self.torchvision_model_type][
self.model_variant
].create_model_function
# get weight specification
if use_pretrained and not saved_weights_in_checkpoint:
weights_specification = torchvision_model_registry[self.torchvision_model_type][
self.model_variant
].model_weights.DEFAULT
logger.info(
f"Instantiating torchvision image encoder '{self.torchvision_model_type}' with pretrained weights: "
f"{weights_specification}."
)
else:
weights_specification = None
if saved_weights_in_checkpoint:
logger.info(
f"Instantiating torchvision image encoder: '{self.torchvision_model_type}' "
"with weights saved in the checkpoint."
)
else:
logger.info(
f"Instantiating torchvision image encoder: '{self.torchvision_model_type}' "
"with no pretrained weights."
)
# get torchvision transforms object
transforms_obj = torchvision_model_registry[self.torchvision_model_type][
self.model_variant
].model_weights.DEFAULT.transforms()
# capture key attributes from torchvision transform for later use
self.num_channels = len(transforms_obj.mean)
self.normalize_mean = transforms_obj.mean
self.normalize_std = transforms_obj.std
self.crop_size = transforms_obj.crop_size
logger.debug(f" {self.torchvision_model_type}")
# create pretrained model with pretrained weights or None for untrained model
self.model = self.create_model(weights=weights_specification, **kwargs)
# remove final classification layer
self._remove_softmax_layer()
# freeze parameters if requested
for p in self.model.parameters():
p.requires_grad_(trainable)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
return {ENCODER_OUTPUT: self.model(inputs)}
@abstractmethod
def _remove_softmax_layer(self):
"""Model specific method that allows the final softmax layer to be implemented in the Ludwig Decoder
component. The model specific implementation should change the final softmax layer in the torchvision
model architecture to torch.nn.Identity(). This allows the output tensor from the preceding layer to be
passed to the Ludwig Combiner and then to the Decoder.
Returns: None
"""
raise NotImplementedError()
@property
def output_shape(self) -> torch.Size:
# create synthetic image and run through forward method
inputs = torch.randn([1, *self.input_shape])
output = self.model(inputs)
return torch.Size(output.shape[1:])
@property
def input_shape(self) -> torch.Size:
# expected shape after all pre-processing
# len(transforms_obj.mean) determines the number of channels
# transforms_obj.crop_size determines the height and width of image
# [num_channels, height, width]
return torch.Size([self.num_channels, *(2 * self.crop_size)])
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant(variant_id="base", create_model_function=tvm.alexnet, model_weights=tvm.AlexNet_Weights),
]
)
@register_encoder("alexnet", IMAGE)
class TVAlexNetEncoder(TVBaseEncoder):
# specify base model type
torchvision_model_type: str = "alexnet"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
# TODO: discussion w/ justin
# @property
# def get_torchvision_model_type(self):
# return "alexnet"
def _remove_softmax_layer(self):
self.model.classifier[-1] = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVAlexNetEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant(
variant_id="tiny", create_model_function=tvm.convnext_tiny, model_weights=tvm.ConvNeXt_Tiny_Weights
),
TVModelVariant(
variant_id="small", create_model_function=tvm.convnext_small, model_weights=tvm.ConvNeXt_Small_Weights
),
TVModelVariant(
variant_id="base", create_model_function=tvm.convnext_base, model_weights=tvm.ConvNeXt_Base_Weights
),
TVModelVariant(
variant_id="large", create_model_function=tvm.convnext_large, model_weights=tvm.ConvNeXt_Large_Weights
),
]
)
@register_encoder("convnext", IMAGE)
class TVConvNeXtEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "convnext"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.classifier[-1] = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVConvNeXtEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant(121, tvm.densenet121, tvm.DenseNet121_Weights),
TVModelVariant(161, tvm.densenet161, tvm.DenseNet161_Weights),
TVModelVariant(169, tvm.densenet169, tvm.DenseNet169_Weights),
TVModelVariant(201, tvm.densenet201, tvm.DenseNet201_Weights),
]
)
@register_encoder("densenet", IMAGE)
class TVDenseNetEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "densenet"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.classifier = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVDenseNetEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("b0", tvm.efficientnet_b0, tvm.EfficientNet_B0_Weights),
TVModelVariant("b1", tvm.efficientnet_b1, tvm.EfficientNet_B1_Weights),
TVModelVariant("b2", tvm.efficientnet_b2, tvm.EfficientNet_B2_Weights),
TVModelVariant("b3", tvm.efficientnet_b3, tvm.EfficientNet_B3_Weights),
TVModelVariant("b4", tvm.efficientnet_b4, tvm.EfficientNet_B4_Weights),
TVModelVariant("b5", tvm.efficientnet_b5, tvm.EfficientNet_B5_Weights),
TVModelVariant("b6", tvm.efficientnet_b6, tvm.EfficientNet_B6_Weights),
TVModelVariant("b7", tvm.efficientnet_b7, tvm.EfficientNet_B7_Weights),
TVModelVariant("v2_s", tvm.efficientnet_v2_s, tvm.EfficientNet_V2_S_Weights),
TVModelVariant("v2_m", tvm.efficientnet_v2_m, tvm.EfficientNet_V2_M_Weights),
TVModelVariant("v2_l", tvm.efficientnet_v2_l, tvm.EfficientNet_V2_L_Weights),
]
)
@register_encoder("efficientnet", IMAGE)
class TVEfficientNetEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "efficientnet"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.classifier[-1] = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVEfficientNetEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("base", tvm.googlenet, tvm.GoogLeNet_Weights),
]
)
@register_encoder("googlenet", IMAGE)
class TVGoogLeNetEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "googlenet"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
# if auxiliary network exists, eliminate auxiliary network
# to resolve issue when loading a saved model which does not
# contain the auxiliary network
if self.model.aux_logits:
self.model.aux_logits = False
self.model.aux1 = None
self.model.aux2 = None
def _remove_softmax_layer(self) -> None:
self.model.fc = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVGoogLeNetEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("base", tvm.inception_v3, tvm.Inception_V3_Weights),
]
)
@register_encoder("inceptionv3", IMAGE)
class TVInceptionV3Encoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "inceptionv3"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
# if auxiliary network exists, eliminate auxiliary network
# to resolve issue when loading a saved model which does not
# contain the auxiliary network
if self.model.aux_logits:
self.model.aux_logits = False
self.model.AuxLogits = None
def _remove_softmax_layer(self) -> None:
self.model.fc = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVInceptionV3EncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("t", tvm.maxvit_t, tvm.MaxVit_T_Weights),
]
)
@register_encoder("maxvit", IMAGE)
class TVMaxVitEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "maxvit"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.classifier[-1] = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVMaxVitEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("0_5", tvm.mnasnet0_5, tvm.mnasnet.MNASNet0_5_Weights),
TVModelVariant("0_75", tvm.mnasnet0_75, tvm.mnasnet.MNASNet0_75_Weights),
TVModelVariant("1_0", tvm.mnasnet1_0, tvm.mnasnet.MNASNet1_0_Weights),
TVModelVariant("1_3", tvm.mnasnet1_3, tvm.mnasnet.MNASNet1_3_Weights),
]
)
@register_encoder("mnasnet", IMAGE)
class TVMNASNetEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "mnasnet"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.classifier[-1] = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVMNASNetEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("base", tvm.mobilenet_v2, tvm.MobileNet_V2_Weights),
]
)
@register_encoder("mobilenetv2", IMAGE)
class TVMobileNetV2Encoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "mobilenetv2"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.classifier[-1] = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVMobileNetV2EncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("small", tvm.mobilenet_v3_small, tvm.MobileNet_V3_Small_Weights),
TVModelVariant("large", tvm.mobilenet_v3_large, tvm.MobileNet_V3_Large_Weights),
]
)
@register_encoder("mobilenetv3", IMAGE)
class TVMobileNetV3Encoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "mobilenetv3"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.classifier[-1] = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVMobileNetV3EncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("x_16gf", tvm.regnet_x_16gf, tvm.RegNet_X_16GF_Weights),
TVModelVariant("x_1_6gf", tvm.regnet_x_1_6gf, tvm.RegNet_X_1_6GF_Weights),
TVModelVariant("x_32gf", tvm.regnet_x_32gf, tvm.RegNet_X_32GF_Weights),
TVModelVariant("x_3_2gf", tvm.regnet_x_3_2gf, tvm.RegNet_X_3_2GF_Weights),
TVModelVariant("x_400mf", tvm.regnet_x_400mf, tvm.RegNet_X_400MF_Weights),
TVModelVariant("x_800mf", tvm.regnet_x_800mf, tvm.RegNet_X_800MF_Weights),
TVModelVariant("x_8gf", tvm.regnet_x_8gf, tvm.RegNet_X_8GF_Weights),
TVModelVariant("y_128gf", tvm.regnet_y_128gf, tvm.RegNet_Y_128GF_Weights),
TVModelVariant("y_16gf", tvm.regnet_y_16gf, tvm.RegNet_Y_16GF_Weights),
TVModelVariant("y_1_6gf", tvm.regnet_y_1_6gf, tvm.RegNet_Y_1_6GF_Weights),
TVModelVariant("y_32gf", tvm.regnet_y_32gf, tvm.RegNet_Y_32GF_Weights),
TVModelVariant("y_3_2gf", tvm.regnet_y_3_2gf, tvm.RegNet_Y_3_2GF_Weights),
TVModelVariant("y_400mf", tvm.regnet_y_400mf, tvm.RegNet_Y_400MF_Weights),
TVModelVariant("y_800mf", tvm.regnet_y_800mf, tvm.RegNet_Y_800MF_Weights),
TVModelVariant("y_8gf", tvm.regnet_y_8gf, tvm.RegNet_Y_8GF_Weights),
]
)
@register_encoder("regnet", IMAGE)
class TVRegNetEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "regnet"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.fc = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVRegNetEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant(18, tvm.resnet18, tvm.ResNet18_Weights),
TVModelVariant(34, tvm.resnet34, tvm.ResNet34_Weights),
TVModelVariant(50, tvm.resnet50, tvm.ResNet50_Weights),
TVModelVariant(101, tvm.resnet101, tvm.ResNet101_Weights),
TVModelVariant(152, tvm.resnet152, tvm.ResNet152_Weights),
]
)
@register_encoder("resnet", IMAGE)
class TVResNetEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "resnet"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.fc = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVResNetEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("50_32x4d", tvm.resnext50_32x4d, tvm.ResNeXt50_32X4D_Weights),
TVModelVariant("101_328xd", tvm.resnext101_32x8d, tvm.ResNeXt101_32X8D_Weights),
TVModelVariant("101_64x4d", tvm.resnext101_64x4d, tvm.ResNeXt101_64X4D_Weights),
]
)
@register_encoder("resnext", IMAGE)
class TVResNeXtEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "resnext"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.fc = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVResNeXtEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("x0_5", tvm.shufflenet_v2_x0_5, tvm.ShuffleNet_V2_X0_5_Weights),
TVModelVariant("x1_0", tvm.shufflenet_v2_x1_0, tvm.ShuffleNet_V2_X1_0_Weights),
TVModelVariant("x1_5", tvm.shufflenet_v2_x1_5, tvm.ShuffleNet_V2_X1_5_Weights),
TVModelVariant("x2_0", tvm.shufflenet_v2_x2_0, tvm.ShuffleNet_V2_X2_0_Weights),
]
)
@register_encoder("shufflenet_v2", IMAGE)
class TVShuffleNetV2Encoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "shufflenet_v2"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.fc = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVShuffleNetV2EncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("1_0", tvm.squeezenet1_0, tvm.SqueezeNet1_0_Weights),
TVModelVariant("1_1", tvm.squeezenet1_1, tvm.SqueezeNet1_1_Weights),
]
)
@register_encoder("squeezenet", IMAGE)
class TVSqueezeNetEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "squeezenet"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
# SqueezeNet does not have a final nn.Linear() layer
# Use flatten output from last AdaptiveAvgPool2d layer
# as encoder output.
pass
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVSqueezeNetEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("t", tvm.swin_t, tvm.Swin_T_Weights),
TVModelVariant("s", tvm.swin_s, tvm.Swin_S_Weights),
TVModelVariant("b", tvm.swin_b, tvm.Swin_B_Weights),
]
)
@register_encoder("swin_transformer", IMAGE)
class TVSwinTransformerEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "swin_transformer"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.head = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVSwinTransformerEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant(11, tvm.vgg11, tvm.VGG11_Weights),
TVModelVariant("11_bn", tvm.vgg11_bn, tvm.VGG11_BN_Weights),
TVModelVariant(13, tvm.vgg13, tvm.VGG13_Weights),
TVModelVariant("13_bn", tvm.vgg13_bn, tvm.VGG13_BN_Weights),
TVModelVariant(16, tvm.vgg16, tvm.VGG16_Weights),
TVModelVariant("16_bn", tvm.vgg16_bn, tvm.VGG16_BN_Weights),
TVModelVariant(19, tvm.vgg19, tvm.VGG19_Weights),
TVModelVariant("19_bn", tvm.vgg19_bn, tvm.VGG19_BN_Weights),
]
)
@register_encoder("vgg", IMAGE)
class TVVGGEncoder(TVBaseEncoder):
# specify base torchvison model
torchvision_model_type: str = "vgg"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.classifier[-1] = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVVGGEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("b_16", tvm.vit_b_16, tvm.ViT_B_16_Weights),
TVModelVariant("b_32", tvm.vit_b_32, tvm.ViT_B_32_Weights),
TVModelVariant("l_16", tvm.vit_l_16, tvm.ViT_L_16_Weights),
TVModelVariant("l_32", tvm.vit_l_32, tvm.ViT_L_32_Weights),
TVModelVariant("h_14", tvm.vit_h_14, tvm.ViT_H_14_Weights),
]
)
@register_encoder("vit", IMAGE)
class TVViTEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "vit"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
# Depending on model variant and weight specification, the expected image size
# will vary. This code determines at run time what the expected image size will be
# and adds to the kwargs dictionary the parameter that specifies the image size.
# this is needed only if not using pretrained weights. If pre-trained weights are
# specified, then the correct image size is set.
if not kwargs["use_pretrained"]:
weights_specification = torchvision_model_registry[self.torchvision_model_type][
kwargs["model_variant"]
].model_weights.DEFAULT
kwargs["image_size"] = weights_specification.transforms.keywords["crop_size"]
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.heads[-1] = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVViTEncoderConfig
@DeveloperAPI
@register_torchvision_model_variants(
[
TVModelVariant("50_2", tvm.wide_resnet50_2, tvm.Wide_ResNet50_2_Weights),
TVModelVariant("101_2", tvm.wide_resnet101_2, tvm.Wide_ResNet101_2_Weights),
]
)
@register_encoder("wide_resnet", IMAGE)
class TVWideResNetEncoder(TVBaseEncoder):
# specify base torchvision model
torchvision_model_type: str = "wide_resnet"
def __init__(
self,
**kwargs,
):
logger.debug(f" {self.name}")
super().__init__(**kwargs)
def _remove_softmax_layer(self) -> None:
self.model.fc = torch.nn.Identity()
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TVWideResNetEncoderConfig
================================================
FILE: ludwig/encoders/registry.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.encoders.base import Encoder
from ludwig.utils.registry import Registry
_encoder_registry = Registry()
_sequence_encoder_registry = Registry()
@DeveloperAPI
def get_encoder_registry() -> Registry:
return _encoder_registry
@DeveloperAPI
def get_sequence_encoder_registry() -> Registry:
return _sequence_encoder_registry
def register_sequence_encoder(name: str):
def wrap(cls):
get_sequence_encoder_registry()[name] = cls
return cls
return wrap
def register_encoder(name: str, features: str | list[str]):
if isinstance(features, str):
features = [features]
def update_registry(registry_getter_fn, cls, feature):
feature_registry = registry_getter_fn().get(feature, {})
feature_registry[name] = cls
registry_getter_fn()[feature] = feature_registry
def wrap(cls):
for feature in features:
update_registry(get_encoder_registry, cls, feature)
return cls
return wrap
def get_encoder_cls(feature: str, name: str) -> type[Encoder]:
return get_encoder_registry()[feature][name]
def get_encoder_classes(feature: str) -> dict[str, type[Encoder]]:
return get_encoder_registry()[feature]
================================================
FILE: ludwig/encoders/sequence_encoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from torch import nn
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import AUDIO, ENCODER_OUTPUT, ENCODER_OUTPUT_STATE, SEQUENCE, TEXT, TIMESERIES
from ludwig.encoders.base import Encoder
from ludwig.encoders.registry import register_encoder, register_sequence_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.modules.attention_modules import TransformerStack
from ludwig.modules.convolutional_modules import Conv1DStack, ParallelConv1D, ParallelConv1DStack
from ludwig.modules.embedding_modules import EmbedSequence, TokenAndPositionEmbedding
from ludwig.modules.fully_connected_modules import FCStack
from ludwig.modules.recurrent_modules import RecurrentStack
from ludwig.modules.reduction_modules import SequenceReducer
from ludwig.schema.encoders.sequence_encoders import (
ParallelCNNConfig,
SequenceEmbedConfig,
SequenceEncoderConfig,
SequencePassthroughConfig,
StackedCNNConfig,
StackedCNNRNNConfig,
StackedParallelCNNConfig,
StackedRNNConfig,
StackedTransformerConfig,
)
logger = logging.getLogger(__name__)
class SequenceEncoder(Encoder):
pass
@DeveloperAPI
@register_encoder("passthrough", [SEQUENCE, TEXT, TIMESERIES])
class SequencePassthroughEncoder(SequenceEncoder):
def __init__(
self,
reduce_output: str = None,
max_sequence_length: int = 256,
encoding_size: int = None,
encoder_config=None,
**kwargs,
):
"""
:param reduce_output: defines how to reduce the output tensor along
the `s` sequence length dimension if the rank of the tensor
is greater than 2. Available values are: `sum`,
`mean` or `avg`, `max`, `concat` (concatenates along
the first dimension), `last` (returns the last vector of the
first dimension) and `None` or `null` (which does not reduce
and returns the full tensor).
:param max_sequence_length: The maximum sequence length.
:param encoding_size: The size of the encoding vector, or None if sequence elements are scalars.
"""
super().__init__()
self.config = encoder_config
self.max_sequence_length = max_sequence_length
logger.debug(f" {self.name}")
self.reduce_output = reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=reduce_output, max_sequence_length=max_sequence_length, encoding_size=encoding_size
)
if self.reduce_output is None:
self.supports_masking = True
def forward(self, input_sequence: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param input_sequence: The input sequence fed into the encoder.
Shape: [batch x sequence length], type torch.int32 or
[batch x sequence length x encoding size], type torch.float32
:type input_sequence: Tensor
:param mask: Sequence mask (not yet implemented).
Shape: [batch x sequence length]
:type mask: Tensor
"""
input_sequence = input_sequence.type(torch.float32)
while len(input_sequence.shape) < 3:
input_sequence = input_sequence.unsqueeze(-1)
hidden = self.reduce_sequence(input_sequence)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[SequenceEncoderConfig]:
return SequencePassthroughConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
return self.input_shape
@DeveloperAPI
@register_encoder("embed", [SEQUENCE, TEXT])
class SequenceEmbedEncoder(SequenceEncoder):
def __init__(
self,
vocab,
max_sequence_length,
representation="dense",
embedding_size=256,
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=False,
weights_initializer=None,
dropout=0,
reduce_output="sum",
encoder_config=None,
**kwargs,
):
"""
:param vocab: Vocabulary of the input feature to encode
:type vocab: List
:param max_sequence_length: The maximum sequence length.
:type max_sequence_length: int
:param representation: the possible values are `dense` and `sparse`.
`dense` means the embeddings are initialized randomly,
`sparse` means they are initialized to be one-hot encodings.
:type representation: str (one of 'dense' or 'sparse')
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for ``).
:type embedding_size: Integer
:param embeddings_trainable: If `True` embeddings are trained during
the training process, if `False` embeddings are fixed.
It may be useful when loading pretrained embeddings
for avoiding finetuning them. This parameter has effect only
for `representation` is `dense` as `sparse` one-hot encodings
are not trainable.
:type embeddings_trainable: Boolean
:param pretrained_embeddings: by default `dense` embeddings
are initialized randomly, but this parameter allows to specify
a path to a file containing embeddings in the GloVe format.
When the file containing the embeddings is loaded, only the
embeddings with labels present in the vocabulary are kept,
the others are discarded. If the vocabulary contains strings
that have no match in the embeddings file, their embeddings
are initialized with the average of all other embedding plus
some random noise to make them different from each other.
This parameter has effect only if `representation` is `dense`.
:type pretrained_embeddings: str (filepath)
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memory and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:type embeddings_on_cpu: Boolean
:param weights_initializer: the initializer to use. If `None`, the default
initialized of each variable is used (`xavier_uniform`
in most cases). Options are: `constant`, `identity`, `zeros`,
`ones`, `orthogonal`, `normal`, `uniform`,
`truncated_normal`, `variance_scaling`, `xavier_normal`,
`xavier_uniform`, `xavier_normal`,
`he_normal`, `he_uniform`, `lecun_normal`, `lecun_uniform`.
Alternatively it is possible to specify a dictionary with
a key `type` that identifies the type of initializer and
other keys for its parameters, e.g.
`{type: normal, mean: 0, stddev: 0}`.
To know the parameters of each initializer, please refer to
PyTorch's documentation.
:type weights_initializer: str
:param dropout: Tensor (torch.float) The dropout probability.
:type dropout: Tensor
:param reduce_output: defines how to reduce the output tensor along
the `s` sequence length dimension if the rank of the tensor
is greater than 2. Available values are: `sum`,
`mean` or `avg`, `max`, `concat` (concatenates along
the first dimension), `last` (returns the last vector of the
first dimension) and `None` or `null` (which does not reduce
and returns the full tensor).
:type reduce_output: str
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.embedding_size = embedding_size
self.max_sequence_length = max_sequence_length
self.reduce_output = reduce_output
if self.reduce_output is None:
self.supports_masking = True
logger.debug(" EmbedSequence")
self.embed_sequence = EmbedSequence(
vocab,
embedding_size,
max_sequence_length=max_sequence_length,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
self.reduce_sequence = SequenceReducer(
reduce_mode=reduce_output,
max_sequence_length=max_sequence_length,
encoding_size=self.embed_sequence.output_shape[-1],
)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The input sequence fed into the encoder.
Shape: [batch x sequence length], type torch.int32
:param mask: Input mask (unused, not yet implemented in EmbedSequence)
"""
embedded_sequence = self.embed_sequence(inputs, mask=mask)
hidden = self.reduce_sequence(embedded_sequence)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[SequenceEncoderConfig]:
return SequenceEmbedConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
return self.reduce_sequence.output_shape
@DeveloperAPI
@register_sequence_encoder("parallel_cnn")
@register_encoder("parallel_cnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
class ParallelCNN(SequenceEncoder):
def __init__(
self,
should_embed=True,
vocab=None,
representation="dense",
embedding_size=256,
max_sequence_length=None,
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=False,
conv_layers=None,
num_conv_layers=None,
filter_size=3,
num_filters=256,
pool_function="max",
pool_size=None,
fc_layers=None,
num_fc_layers=None,
output_size=256,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
norm=None,
norm_params=None,
activation="relu",
dropout=0,
reduce_output="max",
encoder_config=None,
**kwargs,
):
# todo: revise docstring
"""
:param should_embed: If True the input sequence is expected
to be made of integers and will be mapped into embeddings
:type should_embed: Boolean
:param vocab: Vocabulary of the input feature to encode
:type vocab: List
:param representation: the possible values are `dense` and `sparse`.
`dense` means the embeddings are initialized randomly,
`sparse` means they are initialized to be one-hot encodings.
:type representation: Str (one of 'dense' or 'sparse')
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for ``).
:type embedding_size: Integer
:param embeddings_trainable: If `True` embeddings are trained during
the training process, if `False` embeddings are fixed.
It may be useful when loading pretrained embeddings
for avoiding finetuning them. This parameter has effect only
for `representation` is `dense` as `sparse` one-hot encodings
are not trainable.
:type embeddings_trainable: Boolean
:param pretrained_embeddings: by default `dense` embeddings
are initialized randomly, but this parameter allows to specify
a path to a file containing embeddings in the GloVe format.
When the file containing the embeddings is loaded, only the
embeddings with labels present in the vocabulary are kept,
the others are discarded. If the vocabulary contains strings
that have no match in the embeddings file, their embeddings
are initialized with the average of all other embedding plus
some random noise to make them different from each other.
This parameter has effect only if `representation` is `dense`.
:type pretrained_embeddings: str (filepath)
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memroy and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:param conv_layers: it is a list of dictionaries containing
the parameters of all the convolutional layers. The length
of the list determines the number of parallel convolutional
layers and the content of each dictionary determines
the parameters for a specific layer. The available parameters
for each layer are: `filter_size`, `num_filters`, `pool`,
`norm`, and `activation`. If any of those values
is missing from the dictionary, the default one specified
as a parameter of the encoder will be used instead. If both
`conv_layers` and `num_conv_layers` are `None`, a default
list will be assigned to `conv_layers` with the value
`[{filter_size: 2}, {filter_size: 3}, {filter_size: 4},
{filter_size: 5}]`.
:type conv_layers: List
:param num_conv_layers: if `conv_layers` is `None`, this is
the number of parallel convolutional layers.
:type num_conv_layers: Integer
:param filter_size: if a `filter_size` is not already specified in
`conv_layers` this is the default `filter_size` that
will be used for each layer. It indicates how wide is
the 1d convolutional filter.
:type filter_size: Integer
:param num_filters: if a `num_filters` is not already specified in
`conv_layers` this is the default `num_filters` that
will be used for each layer. It indicates the number
of filters, and by consequence the output channels of
the 1d convolution.
:type num_filters: Integer
:param pool_size: if a `pool_size` is not already specified
in `conv_layers` this is the default `pool_size` that
will be used for each layer. It indicates the size of
the max pooling that will be performed along the `s` sequence
dimension after the convolution operation.
:type pool_size: Integer
:param fc_layers: it is a list of dictionaries containing
the parameters of all the fully connected layers. The length
of the list determines the number of stacked fully connected
layers and the content of each dictionary determines
the parameters for a specific layer. The available parameters
for each layer are: `output_size`, `norm` and `activation`.
If any of those values is missing from
the dictionary, the default one specified as a parameter of
the encoder will be used instead. If both `fc_layers` and
`num_fc_layers` are `None`, a default list will be assigned
to `fc_layers` with the value
`[{output_size: 512}, {output_size: 256}]`
(only applies if `reduce_output` is not `None`).
:type fc_layers: List
:param num_fc_layers: if `fc_layers` is `None`, this is the number
of stacked fully connected layers (only applies if
`reduce_output` is not `None`).
:type num_fc_layers: Integer
:param output_size: if a `output_size` is not already specified in
`fc_layers` this is the default `output_size` that will be used
for each layer. It indicates the size of the output
of a fully connected layer.
:type output_size: Integer
:param norm: if a `norm` is not already specified in `conv_layers`
or `fc_layers` this is the default `norm` that will be used
for each layer. It indicates the norm of the output.
:type norm: str
:param activation: Default activation function to use
:type activation: Str
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: Boolean
:param initializer: the initializer to use. If `None` it uses
`xavier_uniform`. Options are: `constant`, `identity`,
`zeros`, `ones`, `orthogonal`, `normal`, `uniform`,
`truncated_normal`, `variance_scaling`, `xavier_normal`,
`xavier_uniform`, `xavier_normal`,
`he_normal`, `he_uniform`, `lecun_normal`, `lecun_uniform`.
Alternatively it is possible to specify a dictionary with
a key `type` that identifies the type of initializer and
other keys for its parameters,
e.g. `{type: normal, mean: 0, stddev: 0}`.
To know the parameters of each initializer, please refer
to PyTorch's documentation.
:type initializer: str
:param reduce_output: defines how to reduce the output tensor of
the convolutional layers along the `s` sequence length
dimension if the rank of the tensor is greater than 2.
Available values are: `sum`, `mean` or `avg`, `max`, `concat`
(concatenates along the first dimension), `last` (returns
the last vector of the first dimension) and `None` or `null`
(which does not reduce and returns the full tensor).
:type reduce_output: str
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.max_sequence_length = max_sequence_length
if conv_layers is not None and num_conv_layers is None:
# use custom-defined layers
self.conv_layers = conv_layers
self.num_conv_layers = len(conv_layers)
elif conv_layers is None and num_conv_layers is not None:
# generate num_conv_layers with default parameters
self.conv_layers = None
self.num_conv_layers = num_conv_layers
elif conv_layers is None and num_conv_layers is None:
# use default layers with varying filter sizes
self.conv_layers = [{"filter_size": 2}, {"filter_size": 3}, {"filter_size": 4}, {"filter_size": 5}]
self.num_conv_layers = 4
else:
raise ValueError("Invalid layer parametrization, use either conv_layers or num_conv_layers")
# The user is expected to provide fc_layers or num_fc_layers
# The following logic handles the case where the user either provides
# both or neither.
if fc_layers is None and num_fc_layers is None:
# use default layers with varying filter sizes
fc_layers = [{"output_size": 512}, {"output_size": 256}]
num_fc_layers = 2
elif fc_layers is not None and num_fc_layers is not None:
raise ValueError("Invalid layer parametrization, use either fc_layers or num_fc_layers only. Not both.")
self.should_embed = should_embed
self.embed_sequence = None
if self.should_embed:
logger.debug(" EmbedSequence")
self.embed_sequence = EmbedSequence(
vocab,
embedding_size,
max_sequence_length=max_sequence_length,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" ParallelConv1D")
in_channels = self.embed_sequence.output_shape[-1] if self.should_embed else embedding_size
self.parallel_conv1d = ParallelConv1D(
in_channels=in_channels,
max_sequence_length=self.max_sequence_length,
layers=self.conv_layers,
default_num_filters=num_filters,
default_filter_size=filter_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
default_pool_function=pool_function,
default_pool_size=pool_size,
default_pool_padding="same",
)
self.reduce_output = reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=reduce_output,
max_sequence_length=max_sequence_length,
encoding_size=self.parallel_conv1d.output_shape[-1],
)
if self.reduce_output is not None:
logger.debug(" FCStack")
self.fc_stack = FCStack(
self.reduce_sequence.output_shape[-1],
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The input sequence fed into the encoder.
Shape: [batch x sequence length], type torch.int32
:param mask: Input mask (unused, not yet implemented)
"""
# ================ Embeddings ================
if self.should_embed:
embedded_sequence = self.embed_sequence(inputs, mask=mask)
else:
embedded_sequence = inputs
while len(embedded_sequence.shape) < 3:
embedded_sequence = embedded_sequence.unsqueeze(-1)
embedded_sequence = embedded_sequence.to(dtype=torch.float)
# shape=(?, sequence_length, embedding_size)
hidden = embedded_sequence
# ================ Conv Layers ================
hidden = self.parallel_conv1d(hidden, mask=mask)
# ================ Sequence Reduction ================
if self.reduce_output is not None:
hidden = self.reduce_sequence(hidden)
# ================ FC Layers ================
hidden = self.fc_stack(hidden, mask=mask)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[SequenceEncoderConfig]:
return ParallelCNNConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is not None:
return self.fc_stack.output_shape
return self.parallel_conv1d.output_shape
@DeveloperAPI
@register_sequence_encoder("stacked_cnn")
@register_encoder("stacked_cnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
class StackedCNN(SequenceEncoder):
def __init__(
self,
should_embed=True,
vocab=None,
representation="dense",
embedding_size=256,
max_sequence_length=None,
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=False,
conv_layers=None,
num_conv_layers=None,
num_filters=256,
filter_size=5,
strides=1,
# todo: assess how to specify padding for equivalent to 'same'
padding="same",
dilation_rate=1,
pool_function="max",
pool_size=None,
pool_strides=None,
# todo: determine how to pool_padding equivalent of 'same'
pool_padding="same",
fc_layers=None,
num_fc_layers=None,
output_size=256,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
norm=None,
norm_params=None,
activation="relu",
dropout=0,
reduce_output="max",
encoder_config=None,
**kwargs,
):
# todo: fixup docstring
"""
:param should_embed: If True the input sequence is expected
to be made of integers and will be mapped into embeddings
:type should_embed: Boolean
:param vocab: Vocabulary of the input feature to encode
:type vocab: List
:param representation: the possible values are `dense` and `sparse`.
`dense` means the embeddings are initialized randomly,
`sparse` means they are initialized to be one-hot encodings.
:type representation: Str (one of 'dense' or 'sparse')
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for ``).
:type embedding_size: Integer
:param embeddings_trainable: If `True` embeddings are trained during
the training process, if `False` embeddings are fixed.
It may be useful when loading pretrained embeddings
for avoiding finetuning them. This parameter has effect only
for `representation` is `dense` as `sparse` one-hot encodings
are not trainable.
:type embeddings_trainable: Boolean
:param pretrained_embeddings: by default `dense` embeddings
are initialized randomly, but this parameter allows to specify
a path to a file containing embeddings in the GloVe format.
When the file containing the embeddings is loaded, only the
embeddings with labels present in the vocabulary are kept,
the others are discarded. If the vocabulary contains strings
that have no match in the embeddings file, their embeddings
are initialized with the average of all other embedding plus
some random noise to make them different from each other.
This parameter has effect only if `representation` is `dense`.
:type pretrained_embeddings: str (filepath)
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memroy and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:param conv_layers: it is a list of dictionaries containing
the parameters of all the convolutional layers. The length
of the list determines the number of parallel convolutional
layers and the content of each dictionary determines
the parameters for a specific layer. The available parameters
for each layer are: `filter_size`, `num_filters`, `pool`,
`norm` and `activation`. If any of those values
is missing from the dictionary, the default one specified
as a parameter of the encoder will be used instead. If both
`conv_layers` and `num_conv_layers` are `None`, a default
list will be assigned to `conv_layers` with the value
`[{filter_size: 2}, {filter_size: 3}, {filter_size: 4},
{filter_size: 5}]`.
:type conv_layers: List
:param num_conv_layers: if `conv_layers` is `None`, this is
the number of stacked convolutional layers.
:type num_conv_layers: Integer
:param filter_size: if a `filter_size` is not already specified in
`conv_layers` this is the default `filter_size` that
will be used for each layer. It indicates how wide is
the 1d convolutional filter.
:type filter_size: Integer
:param num_filters: if a `num_filters` is not already specified in
`conv_layers` this is the default `num_filters` that
will be used for each layer. It indicates the number
of filters, and by consequence the output channels of
the 1d convolution.
:type num_filters: Integer
:param pool_size: if a `pool_size` is not already specified
in `conv_layers` this is the default `pool_size` that
will be used for each layer. It indicates the size of
the max pooling that will be performed along the `s` sequence
dimension after the convolution operation.
:type pool_size: Integer
:param fc_layers: it is a list of dictionaries containing
the parameters of all the fully connected layers. The length
of the list determines the number of stacked fully connected
layers and the content of each dictionary determines
the parameters for a specific layer. The available parameters
for each layer are: `output_size`, `norm` and `activation`.
If any of those values is missing from
the dictionary, the default one specified as a parameter of
the encoder will be used instead. If both `fc_layers` and
`num_fc_layers` are `None`, a default list will be assigned
to `fc_layers` with the value
`[{output_size: 512}, {output_size: 256}]`
(only applies if `reduce_output` is not `None`).
:type fc_layers: List
:param num_fc_layers: if `fc_layers` is `None`, this is the number
of stacked fully connected layers (only applies if
`reduce_output` is not `None`).
:type num_fc_layers: Integer
:param output_size: if a `output_size` is not already specified in
`fc_layers` this is the default `output_size` that will be used
for each layer. It indicates the size of the output
of a fully connected layer.
:type output_size: Integer
:param norm: if a `norm` is not already specified in `conv_layers`
or `fc_layers` this is the default `norm` that will be used
for each layer. It indicates the norm of the output.
:type norm: str
:param activation: Default activation function to use
:type activation: Str
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: Boolean
:param initializer: the initializer to use. If `None` it uses
`xavier_uniform`. Options are: `constant`, `identity`,
`zeros`, `ones`, `orthogonal`, `normal`, `uniform`,
`truncated_normal`, `variance_scaling`, `xavier_normal`,
`xavier_uniform`, `xavier_normal`,
`he_normal`, `he_uniform`, `lecun_normal`, `lecun_uniform`.
Alternatively it is possible to specify a dictionary with
a key `type` that identifies the type of initializer and
other keys for its parameters,
e.g. `{type: normal, mean: 0, stddev: 0}`.
To know the parameters of each initializer, please refer
to PyTorch's documentation.
:type initializer: str
:param reduce_output: defines how to reduce the output tensor of
the convolutional layers along the `s` sequence length
dimension if the rank of the tensor is greater than 2.
Available values are: `sum`, `mean` or `avg`, `max`, `concat`
(concatenates along the first dimension), `last` (returns
the last vector of the first dimension) and `None` or `null`
(which does not reduce and returns the full tensor).
:type reduce_output: str
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
if conv_layers is not None and num_conv_layers is None:
# use custom-defined layers
self.conv_layers = conv_layers
self.num_conv_layers = len(conv_layers)
elif conv_layers is None and num_conv_layers is not None:
# generate num_conv_layers with default parameters
self.conv_layers = None
self.num_conv_layers = num_conv_layers
elif conv_layers is None and num_conv_layers is None:
# use default layers with varying filter sizes
self.conv_layers = [
{
"filter_size": 7,
"pool_size": 3,
},
{
"filter_size": 7,
"pool_size": 3,
},
{
"filter_size": 3,
"pool_size": None,
},
{
"filter_size": 3,
"pool_size": None,
},
{
"filter_size": 3,
"pool_size": None,
},
{
"filter_size": 3,
"pool_size": 3,
},
]
self.num_conv_layers = 6
else:
raise ValueError("Invalid layer parametrization, use either conv_layers or " "num_conv_layers")
# The user is expected to provide fc_layers or num_fc_layers
# The following logic handles the case where the user either provides
# both or neither.
if fc_layers is None and num_fc_layers is None:
# use default layers with varying filter sizes
fc_layers = [{"output_size": 512}, {"output_size": 256}]
num_fc_layers = 2
elif fc_layers is not None and num_fc_layers is not None:
raise ValueError("Invalid layer parametrization, use either fc_layers or " "num_fc_layers only. Not both.")
self.max_sequence_length = max_sequence_length
self.num_filters = num_filters
self.should_embed = should_embed
self.embed_sequence = None
if self.should_embed:
logger.debug(" EmbedSequence")
self.embed_sequence = EmbedSequence(
vocab,
embedding_size,
max_sequence_length=self.max_sequence_length,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" Conv1DStack")
in_channels = self.embed_sequence.output_shape[-1] if self.should_embed else embedding_size
self.conv1d_stack = Conv1DStack(
in_channels=in_channels,
max_sequence_length=max_sequence_length,
layers=self.conv_layers,
default_num_filters=num_filters,
default_filter_size=filter_size,
default_strides=strides,
default_padding=padding,
default_dilation_rate=dilation_rate,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
default_pool_function=pool_function,
default_pool_size=pool_size,
default_pool_strides=pool_strides,
default_pool_padding=pool_padding,
)
self.reduce_output = reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=reduce_output,
max_sequence_length=self.conv1d_stack.output_shape[-2],
encoding_size=self.conv1d_stack.output_shape[-1],
)
if self.reduce_output is not None:
logger.debug(" FCStack")
self.fc_stack = FCStack(
self.reduce_sequence.output_shape[-1],
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
@staticmethod
def get_schema_cls() -> type[SequenceEncoderConfig]:
return StackedCNNConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
return self.conv1d_stack.output_shape
return self.fc_stack.output_shape
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The input sequence fed into the encoder.
Shape: [batch x sequence length], type torch.int32
:param mask: Input mask (unused, not yet implemented)
"""
# ================ Embeddings ================
if self.should_embed:
embedded_sequence = self.embed_sequence(inputs, mask=mask)
else:
embedded_sequence = inputs
while len(embedded_sequence.shape) < 3:
embedded_sequence = embedded_sequence.unsqueeze(-1)
# shape=(?, sequence_length, embedding_size)
hidden = embedded_sequence
# ================ Conv Layers ================
hidden = self.conv1d_stack(hidden, mask=mask)
# ================ Sequence Reduction ================
if self.reduce_output is not None:
hidden = self.reduce_sequence(hidden)
# ================ FC Layers ================
hidden = self.fc_stack(hidden, mask=mask)
# no reduction: hidden [batch_size, seq_size, num_filters]
# with reduction: hidden [batch_size, output_size]
return {ENCODER_OUTPUT: hidden}
@DeveloperAPI
@register_sequence_encoder("stacked_parallel_cnn")
@register_encoder("stacked_parallel_cnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
class StackedParallelCNN(SequenceEncoder):
def __init__(
self,
should_embed=True,
vocab=None,
representation="dense",
embedding_size=256,
max_sequence_length=None,
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=False,
stacked_layers=None,
num_stacked_layers=None,
filter_size=3,
num_filters=256,
pool_function="max",
pool_size=None,
fc_layers=None,
num_fc_layers=None,
output_size=256,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
norm=None,
norm_params=None,
activation="relu",
dropout=0,
reduce_output="max",
encoder_config=None,
**kwargs,
):
# todo: review docstring
"""
:param should_embed: If True the input sequence is expected
to be made of integers and will be mapped into embeddings
:type should_embed: Boolean
:param vocab: Vocabulary of the input feature to encode
:type vocab: List
:param representation: the possible values are `dense` and `sparse`.
`dense` means the embeddings are initialized randomly,
`sparse` means they are initialized to be one-hot encodings.
:type representation: Str (one of 'dense' or 'sparse')
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for ``).
:type embedding_size: Integer
:param embeddings_trainable: If `True` embeddings are trained during
the training process, if `False` embeddings are fixed.
It may be useful when loading pretrained embeddings
for avoiding finetuning them. This parameter has effect only
for `representation` is `dense` as `sparse` one-hot encodings
are not trainable.
:type embeddings_trainable: Boolean
:param pretrained_embeddings: by default `dense` embeddings
are initialized randomly, but this parameter allows to specify
a path to a file containing embeddings in the GloVe format.
When the file containing the embeddings is loaded, only the
embeddings with labels present in the vocabulary are kept,
the others are discarded. If the vocabulary contains strings
that have no match in the embeddings file, their embeddings
are initialized with the average of all other embedding plus
some random noise to make them different from each other.
This parameter has effect only if `representation` is `dense`.
:type pretrained_embeddings: str (filepath)
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memroy and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:param stacked_layers: it is a of lists of list of dictionaries
containing the parameters of the stack of
parallel convolutional layers. The length of the list
determines the number of stacked parallel
convolutional layers, length of the sub-lists determines
the number of parallel conv layers and the content
of each dictionary determines the parameters for
a specific layer. The available parameters for each layer are:
`filter_size`, `num_filters`, `pool_size`, `norm` and
`activation`. If any of those values
is missing from the dictionary, the default one specified
as a parameter of the encoder will be used instead. If both
`stacked_layers` and `num_stacked_layers` are `None`,
a default list will be assigned to `stacked_layers` with
the value `[[{filter_size: 2}, {filter_size: 3},
{filter_size: 4}, {filter_size: 5}], [{filter_size: 2},
{filter_size: 3}, {filter_size: 4}, {filter_size: 5}],
[{filter_size: 2}, {filter_size: 3}, {filter_size: 4},
{filter_size: 5}]]`.
:type stacked_layers: List
:param num_stacked_layers: if `stacked_layers` is `None`, this is
the number of elements in the stack of
parallel convolutional layers.
:type num_stacked_layers: Integer
:param filter_size: if a `filter_size` is not already specified in
`conv_layers` this is the default `filter_size` that
will be used for each layer. It indicates how wide is
the 1d convolutional filter.
:type filter_size: Integer
:param num_filters: if a `num_filters` is not already specified in
`conv_layers` this is the default `num_filters` that
will be used for each layer. It indicates the number
of filters, and by consequence the output channels of
the 1d convolution.
:type num_filters: Integer
:param pool_size: if a `pool_size` is not already specified
in `conv_layers` this is the default `pool_size` that
will be used for each layer. It indicates the size of
the max pooling that will be performed along the `s` sequence
dimension after the convolution operation.
:type pool_size: Integer
:param fc_layers: it is a list of dictionaries containing
the parameters of all the fully connected layers. The length
of the list determines the number of stacked fully connected
layers and the content of each dictionary determines
the parameters for a specific layer. The available parameters
for each layer are: `output_size`, `norm` and `activation`.
If any of those values is missing from
the dictionary, the default one specified as a parameter of
the encoder will be used instead. If both `fc_layers` and
`num_fc_layers` are `None`, a default list will be assigned
to `fc_layers` with the value
`[{output_size: 512}, {output_size: 256}]`
(only applies if `reduce_output` is not `None`).
:type fc_layers: List
:param num_fc_layers: if `fc_layers` is `None`, this is the number
of stacked fully connected layers (only applies if
`reduce_output` is not `None`).
:type num_fc_layers: Integer
:param output_size: if a `output_size` is not already specified in
`fc_layers` this is the default `output_size` that will be used
for each layer. It indicates the size of the output
of a fully connected layer.
:type output_size: Integer
:param norm: if a `norm` is not already specified in `conv_layers`
or `fc_layers` this is the default `norm` that will be used
for each layer. It indicates the norm of the output.
:type norm: str
:param activation: Default activation function to use
:type activation: Str
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: Boolean
:param initializer: the initializer to use. If `None` it uses
`xavier_uniform`. Options are: `constant`, `identity`,
`zeros`, `ones`, `orthogonal`, `normal`, `uniform`,
`truncated_normal`, `variance_scaling`, `xavier_normal`,
`xavier_uniform`, `xavier_normal`,
`he_normal`, `he_uniform`, `lecun_normal`, `lecun_uniform`.
Alternatively it is possible to specify a dictionary with
a key `type` that identifies the type of initializer and
other keys for its parameters,
e.g. `{type: normal, mean: 0, stddev: 0}`.
To know the parameters of each initializer, please refer
to PyTorch's documentation.
:type initializer: str
:param reduce_output: defines how to reduce the output tensor of
the convolutional layers along the `s` sequence length
dimension if the rank of the tensor is greater than 2.
Available values are: `sum`, `mean` or `avg`, `max`, `concat`
(concatenates along the first dimension), `last` (returns
the last vector of the first dimension) and `None` or `null`
(which does not reduce and returns the full tensor).
:type reduce_output: str
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.max_sequence_length = max_sequence_length
self.embedding_size = embedding_size
if stacked_layers is not None and num_stacked_layers is None:
# use custom-defined layers
self.stacked_layers = stacked_layers
self.num_stacked_layers = len(stacked_layers)
elif stacked_layers is None and num_stacked_layers is not None:
# generate num_conv_layers with default parameters
self.stacked_layers = None
self.num_stacked_layers = num_stacked_layers
elif stacked_layers is None and num_stacked_layers is None:
# use default layers with varying filter sizes
self.stacked_layers = [
[{"filter_size": 2}, {"filter_size": 3}, {"filter_size": 4}, {"filter_size": 5}],
[{"filter_size": 2}, {"filter_size": 3}, {"filter_size": 4}, {"filter_size": 5}],
[{"filter_size": 2}, {"filter_size": 3}, {"filter_size": 4}, {"filter_size": 5}],
]
self.num_stacked_layers = 6
else:
raise ValueError("Invalid layer parametrization, use either stacked_layers or" " num_stacked_layers")
# The user is expected to provide fc_layers or num_fc_layers
# The following logic handles the case where the user either provides
# both or neither.
if fc_layers is None and num_fc_layers is None:
# use default layers with varying filter sizes
fc_layers = [{"output_size": 512}, {"output_size": 256}]
num_fc_layers = 2
elif fc_layers is not None and num_fc_layers is not None:
raise ValueError("Invalid layer parametrization, use either fc_layers or " "num_fc_layers only. Not both.")
self.should_embed = should_embed
self.embed_sequence = None
if self.should_embed:
logger.debug(" EmbedSequence")
self.embed_sequence = EmbedSequence(
vocab,
embedding_size,
max_sequence_length=self.max_sequence_length,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
in_channels = self.embed_sequence.output_shape[-1] if self.should_embed else embedding_size
logger.debug(" ParallelConv1DStack")
self.parallel_conv1d_stack = ParallelConv1DStack(
in_channels=in_channels,
stacked_layers=self.stacked_layers,
max_sequence_length=max_sequence_length,
default_num_filters=num_filters,
default_filter_size=filter_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
default_pool_function=pool_function,
default_pool_size=pool_size,
)
self.reduce_output = reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=reduce_output,
max_sequence_length=self.parallel_conv1d_stack.output_shape[-2],
encoding_size=self.parallel_conv1d_stack.output_shape[-1],
)
if self.reduce_output is not None:
logger.debug(" FCStack")
self.fc_stack = FCStack(
self.reduce_sequence.output_shape[-1],
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
@staticmethod
def get_schema_cls() -> type[SequenceEncoderConfig]:
return StackedParallelCNNConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is not None:
return self.fc_stack.output_shape
return self.parallel_conv1d_stack.output_shape
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The input sequence fed into the encoder.
Shape: [batch x sequence length], type torch.int32
:param mask: Input mask (unused, not yet implemented)
"""
# ================ Embeddings ================
if self.should_embed:
embedded_sequence = self.embed_sequence(inputs, mask=mask)
else:
embedded_sequence = inputs
while len(embedded_sequence.shape) < 3:
embedded_sequence = embedded_sequence.unsqueeze(-1)
# shape=(?, sequence_length, embedding_size)
hidden = embedded_sequence
# ================ Conv Layers ================
hidden = self.parallel_conv1d_stack(hidden, mask=mask)
# ================ Sequence Reduction ================
if self.reduce_output is not None:
hidden = self.reduce_sequence(hidden)
# ================ FC Layers ================
hidden = self.fc_stack(hidden, mask=mask)
# no reduction: hidden [batch_size, seq_size, num_filter]
# with reduction: hidden [batch_size, output_size]
return {ENCODER_OUTPUT: hidden}
@DeveloperAPI
@register_sequence_encoder("rnn")
@register_encoder("rnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
class StackedRNN(SequenceEncoder):
def __init__(
self,
should_embed=True,
vocab=None,
representation="dense",
embedding_size=256,
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=False,
num_layers=1,
max_sequence_length=None,
state_size=256,
cell_type="rnn",
bidirectional=False,
activation="tanh",
recurrent_activation="sigmoid",
unit_forget_bias=True,
recurrent_initializer="orthogonal",
dropout=0.0,
recurrent_dropout=0.0,
fc_layers=None,
num_fc_layers=0,
output_size=256,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
norm=None,
norm_params=None,
fc_activation="relu",
fc_dropout=0,
reduce_output="last",
encoder_config=None,
**kwargs,
):
# todo: fix up docstring
"""
:param should_embed: If True the input sequence is expected
to be made of integers and will be mapped into embeddings
:type should_embed: Boolean
:param vocab: Vocabulary of the input feature to encode
:type vocab: List
:param representation: the possible values are `dense` and `sparse`.
`dense` means the embeddings are initialized randomly,
`sparse` means they are initialized to be one-hot encodings.
:type representation: Str (one of 'dense' or 'sparse')
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for ``).
:type embedding_size: Integer
:param embeddings_trainable: If `True` embeddings are trained during
the training process, if `False` embeddings are fixed.
It may be useful when loading pretrained embeddings
for avoiding finetuning them. This parameter has effect only
for `representation` is `dense` as `sparse` one-hot encodings
are not trainable.
:type embeddings_trainable: Boolean
:param pretrained_embeddings: by default `dense` embeddings
are initialized randomly, but this parameter allows to specify
a path to a file containing embeddings in the GloVe format.
When the file containing the embeddings is loaded, only the
embeddings with labels present in the vocabulary are kept,
the others are discarded. If the vocabulary contains strings
that have no match in the embeddings file, their embeddings
are initialized with the average of all other embedding plus
some random noise to make them different from each other.
This parameter has effect only if `representation` is `dense`.
:type pretrained_embeddings: str (filepath)
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memroy and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:param conv_layers: it is a list of dictionaries containing
the parameters of all the convolutional layers. The length
of the list determines the number of parallel convolutional
layers and the content of each dictionary determines
the parameters for a specific layer. The available parameters
for each layer are: `filter_size`, `num_filters`, `pool`,
`norm`, `activation` and `regularize`. If any of those values
is missing from the dictionary, the default one specified
as a parameter of the encoder will be used instead. If both
`conv_layers` and `num_conv_layers` are `None`, a default
list will be assigned to `conv_layers` with the value
`[{filter_size: 2}, {filter_size: 3}, {filter_size: 4},
{filter_size: 5}]`.
:type conv_layers: List
:param num_conv_layers: if `conv_layers` is `None`, this is
the number of stacked convolutional layers.
:type num_conv_layers: Integer
:param filter_size: if a `filter_size` is not already specified in
`conv_layers` this is the default `filter_size` that
will be used for each layer. It indicates how wide is
the 1d convolutional filter.
:type filter_size: Integer
:param num_filters: if a `num_filters` is not already specified in
`conv_layers` this is the default `num_filters` that
will be used for each layer. It indicates the number
of filters, and by consequence the output channels of
the 1d convolution.
:type num_filters: Integer
:param pool_size: if a `pool_size` is not already specified
in `conv_layers` this is the default `pool_size` that
will be used for each layer. It indicates the size of
the max pooling that will be performed along the `s` sequence
dimension after the convolution operation.
:type pool_size: Integer
:param num_rec_layers: the number of stacked recurrent layers.
:type num_rec_layers: Integer
:param cell_type: the type of recurrent cell to use.
Available values are: `rnn`, `lstm`, `gru`.
For reference about the differences between the cells please
refer to PyTorch's documentation.
:type cell_type: str
:param state_size: the size of the state of the rnn.
:type state_size: Integer
:param bidirectional: if `True` two recurrent networks will perform
encoding in the forward and backward direction and
their outputs will be concatenated.
:type bidirectional: Boolean
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: Boolean
:param recurrent_dropout: Dropout rate for the recurrent stack.
:type recurrent_dropout: float
:param initializer: the initializer to use. If `None` it uses
`xavier_uniform`. Options are: `constant`, `identity`,
`zeros`, `ones`, `orthogonal`, `normal`, `uniform`,
`truncated_normal`, `variance_scaling`, `xavier_normal`,
`xavier_uniform`, `xavier_normal`,
`he_normal`, `he_uniform`, `lecun_normal`, `lecun_uniform`.
Alternatively it is possible to specify a dictionary with
a key `type` that identifies the type of initializer and
other keys for its parameters,
e.g. `{type: normal, mean: 0, stddev: 0}`.
To know the parameters of each initializer, please refer
to PyTorch's documentation.
:type initializer: str
:param reduce_output: defines how to reduce the output tensor of
the convolutional layers along the `s` sequence length
dimension if the rank of the tensor is greater than 2.
Available values are: `sum`, `mean` or `avg`, `max`, `concat`
(concatenates along the first dimension), `last` (returns
the last vector of the first dimension) and `None` or `null`
(which does not reduce and returns the full tensor).
:type reduce_output: str
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.max_sequence_length = max_sequence_length
self.hidden_size = state_size
self.embedding_size = embedding_size
self.should_embed = should_embed
self.embed_sequence = None
if self.should_embed:
logger.debug(" EmbedSequence")
self.embed_sequence = EmbedSequence(
vocab,
embedding_size,
max_sequence_length=self.max_sequence_length,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" RecurrentStack")
input_size = self.embed_sequence.output_shape[-1] if self.should_embed else embedding_size
self.recurrent_stack = RecurrentStack(
input_size=input_size,
hidden_size=state_size,
cell_type=cell_type,
max_sequence_length=max_sequence_length,
num_layers=num_layers,
bidirectional=bidirectional,
activation=activation,
recurrent_activation=recurrent_activation,
use_bias=use_bias,
unit_forget_bias=unit_forget_bias,
weights_initializer=weights_initializer,
recurrent_initializer=recurrent_initializer,
bias_initializer=bias_initializer,
dropout=recurrent_dropout,
)
self.reduce_output = reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=reduce_output,
max_sequence_length=self.recurrent_stack.output_shape[-2],
encoding_size=self.recurrent_stack.output_shape[-1], # state_size
)
if self.reduce_output is None:
self.supports_masking = True
else:
logger.debug(" FCStack")
self.fc_stack = FCStack(
self.reduce_sequence.output_shape[-1],
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=fc_activation,
default_dropout=fc_dropout,
)
@staticmethod
def get_schema_cls() -> type[SequenceEncoderConfig]:
return StackedRNNConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is not None:
return self.fc_stack.output_shape
return self.recurrent_stack.output_shape
def input_dtype(self):
return torch.int32
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The input sequence fed into the encoder.
Shape: [batch x sequence length], type torch.int32
:param mask: Input mask (unused, not yet implemented)
"""
# ================ Embeddings ================
if self.should_embed:
embedded_sequence = self.embed_sequence(inputs, mask=mask)
else:
embedded_sequence = inputs
while len(embedded_sequence.shape) < 3:
embedded_sequence = embedded_sequence.unsqueeze(-1)
# shape=(?, sequence_length, embedding_size)
hidden = embedded_sequence
# ================ Recurrent Layers ================
hidden, final_state = self.recurrent_stack(hidden, mask=mask)
# ================ Sequence Reduction ================
if self.reduce_output is not None:
hidden = self.reduce_sequence(hidden)
# ================ FC Layers ================
hidden = self.fc_stack(hidden, mask=mask)
return {ENCODER_OUTPUT: hidden, ENCODER_OUTPUT_STATE: final_state}
@DeveloperAPI
@register_sequence_encoder("cnnrnn")
@register_encoder("cnnrnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
class StackedCNNRNN(SequenceEncoder):
def __init__(
self,
should_embed=True,
vocab=None,
max_sequence_length=None,
representation="dense",
embedding_size=256,
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=False,
conv_layers=None,
num_conv_layers=None,
num_filters=256,
filter_size=5,
strides=1,
padding="same",
dilation_rate=1,
conv_activation="relu",
conv_dropout=0.0,
pool_function="max",
pool_size=2,
pool_strides=None,
pool_padding="same",
num_rec_layers=1,
state_size=256,
cell_type="rnn",
bidirectional=False,
activation="tanh",
recurrent_activation="sigmoid",
unit_forget_bias=True,
recurrent_initializer="orthogonal",
dropout=0.0,
recurrent_dropout=0.0,
fc_layers=None,
num_fc_layers=0,
output_size=256,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
norm=None,
norm_params=None,
fc_activation="relu",
fc_dropout=0,
reduce_output="last",
encoder_config=None,
**kwargs,
):
# todo: fix up docstring
"""
:param should_embed: If True the input sequence is expected
to be made of integers and will be mapped into embeddings
:type should_embed: Boolean
:param vocab: Vocabulary of the input feature to encode
:type vocab: List
:param representation: the possible values are `dense` and `sparse`.
`dense` means the embeddings are initialized randomly,
`sparse` means they are initialized to be one-hot encodings.
:type representation: Str (one of 'dense' or 'sparse')
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for ``).
:type embedding_size: Integer
:param embeddings_trainable: If `True` embeddings are trained during
the training process, if `False` embeddings are fixed.
It may be useful when loading pretrained embeddings
for avoiding finetuning them. This parameter has effect only
for `representation` is `dense` as `sparse` one-hot encodings
are not trainable.
:type embeddings_trainable: Boolean
:param pretrained_embeddings: by default `dense` embeddings
are initialized randomly, but this parameter allows to specify
a path to a file containing embeddings in the GloVe format.
When the file containing the embeddings is loaded, only the
embeddings with labels present in the vocabulary are kept,
the others are discarded. If the vocabulary contains strings
that have no match in the embeddings file, their embeddings
are initialized with the average of all other embedding plus
some random noise to make them different from each other.
This parameter has effect only if `representation` is `dense`.
:type pretrained_embeddings: str (filepath)
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memroy and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:param num_layers: the number of stacked recurrent layers.
:type num_layers: Integer
:param cell_type: the type of recurrent cell to use.
Available values are: `rnn`, `lstm`, `gru`.
For reference about the differences between the cells please
refer to PyTorch's documentation.
:type cell_type: str
:param state_size: the size of the state of the rnn.
:type state_size: Integer
:param bidirectional: if `True` two recurrent networks will perform
encoding in the forward and backward direction and
their outputs will be concatenated.
:type bidirectional: Boolean
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: Boolean
:param recurrent_dropout: Dropout rate for the recurrent stack.
:type recurrent_dropout: float
:param initializer: the initializer to use. If `None` it uses
`xavier_uniform`. Options are: `constant`, `identity`,
`zeros`, `ones`, `orthogonal`, `normal`, `uniform`,
`truncated_normal`, `variance_scaling`, `xavier_normal`,
`xavier_uniform`, `xavier_normal`,
`he_normal`, `he_uniform`, `lecun_normal`, `lecun_uniform`.
Alternatively it is possible to specify a dictionary with
a key `type` that identifies the type of initializer and
other keys for its parameters,
e.g. `{type: normal, mean: 0, stddev: 0}`.
To know the parameters of each initializer, please refer
to PyTorch's documentation.
:type initializer: str
:param reduce_output: defines how to reduce the output tensor of
the convolutional layers along the `s` sequence length
dimension if the rank of the tensor is greater than 2.
Available values are: `sum`, `mean` or `avg`, `max`, `concat`
(concatenates along the first dimension), `last` (returns
the last vector of the first dimension) and `None` or `null`
(which does not reduce and returns the full tensor).
:type reduce_output: str
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
if conv_layers is not None and num_conv_layers is None:
# use custom-defined layers
self.conv_layers = conv_layers
self.num_conv_layers = len(conv_layers)
elif conv_layers is None and num_conv_layers is not None:
# generate num_conv_layers with default parameters
self.conv_layers = None
self.num_conv_layers = num_conv_layers
elif conv_layers is None and num_conv_layers is None:
# use default layers with varying filter sizes
self.conv_layers = [{"pool_size": 3}, {"pool_size": None}]
self.num_conv_layers = 2
else:
raise ValueError("Invalid layer parametrization, use either conv_layers or " "num_conv_layers")
self.max_sequence_length = max_sequence_length
self.should_embed = should_embed
self.embed_sequence = None
if self.should_embed:
logger.debug(" EmbedSequence")
self.embed_sequence = EmbedSequence(
vocab,
embedding_size,
max_sequence_length=self.max_sequence_length,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" Conv1DStack")
in_channels = self.embed_sequence.output_shape[-1] if self.should_embed else embedding_size
self.conv1d_stack = Conv1DStack(
in_channels=in_channels,
max_sequence_length=max_sequence_length,
layers=self.conv_layers,
default_num_filters=num_filters,
default_filter_size=filter_size,
default_strides=strides,
default_padding=padding,
default_dilation_rate=dilation_rate,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=conv_activation,
default_dropout=conv_dropout,
default_pool_function=pool_function,
default_pool_size=pool_size,
default_pool_strides=pool_strides,
default_pool_padding=pool_padding,
)
logger.debug(" RecurrentStack")
self.recurrent_stack = RecurrentStack(
input_size=self.conv1d_stack.output_shape[1],
hidden_size=state_size,
max_sequence_length=self.conv1d_stack.output_shape[0],
cell_type=cell_type,
num_layers=num_rec_layers,
bidirectional=bidirectional,
activation=activation,
recurrent_activation=recurrent_activation,
use_bias=use_bias,
unit_forget_bias=unit_forget_bias,
weights_initializer=weights_initializer,
recurrent_initializer=recurrent_initializer,
bias_initializer=bias_initializer,
dropout=recurrent_dropout,
)
self.reduce_output = reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=reduce_output,
max_sequence_length=self.recurrent_stack.output_shape[-2],
encoding_size=self.recurrent_stack.output_shape[-1], # State size
)
if self.reduce_output is not None:
logger.debug(" FCStack")
self.fc_stack = FCStack(
self.reduce_sequence.output_shape[-1],
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=fc_activation,
default_dropout=fc_dropout,
)
@staticmethod
def get_schema_cls() -> type[SequenceEncoderConfig]:
return StackedCNNRNNConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is not None:
return self.fc_stack.output_shape
return self.recurrent_stack.output_shape
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The input sequence fed into the encoder.
Shape: [batch x sequence length], type torch.int32
:param mask: Input mask (unused, not yet implemented)
"""
# ================ Embeddings ================
if self.should_embed:
embedded_sequence = self.embed_sequence(inputs, mask=mask)
else:
embedded_sequence = inputs
while len(embedded_sequence.shape) < 3:
embedded_sequence = embedded_sequence.unsqueeze(-1)
# shape=(?, sequence_length, embedding_size)
hidden = embedded_sequence
# ================ Conv Layers ================
hidden = self.conv1d_stack(hidden, mask=mask)
# ================ Recurrent Layers ================
hidden, final_state = self.recurrent_stack(hidden)
# ================ Sequence Reduction ================
if self.reduce_output is not None:
hidden = self.reduce_sequence(hidden)
# ================ FC Layers ================
hidden = self.fc_stack(hidden, mask=mask)
# no reduction: hidden [batch_size, seq_size, state_size]
# with reduction: hidden [batch_size, seq_size, output_size]
# final_state: if rnn/gru [batch_size, state_size]
# lstm ([batch_size, state_size], [batch_size, state_size])
return {ENCODER_OUTPUT: hidden, ENCODER_OUTPUT_STATE: final_state}
@DeveloperAPI
@register_sequence_encoder("transformer")
@register_encoder("transformer", [SEQUENCE, TEXT, TIMESERIES])
class StackedTransformer(SequenceEncoder):
def __init__(
self,
max_sequence_length,
should_embed=True,
vocab=None,
representation="dense",
embedding_size=256,
embeddings_trainable=True,
pretrained_embeddings=None,
embeddings_on_cpu=False,
num_layers=1,
hidden_size=256,
num_heads=8,
transformer_output_size=256,
dropout=0.1,
fc_layers=None,
num_fc_layers=0,
output_size=256,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
norm=None,
norm_params=None,
fc_activation="relu",
fc_dropout=0,
reduce_output="last",
encoder_config=None,
**kwargs,
):
# todo: update docstring as needed
"""
:param should_embed: If True the input sequence is expected
to be made of integers and will be mapped into embeddings
:type should_embed: Boolean
:param vocab: Vocabulary of the input feature to encode
:type vocab: List
:param representation: the possible values are `dense` and `sparse`.
`dense` means the embeddings are initialized randomly,
`sparse` means they are initialized to be one-hot encodings.
:type representation: Str (one of 'dense' or 'sparse')
:param embedding_size: it is the maximum embedding size, the actual
size will be `min(vocabulary_size, embedding_size)`
for `dense` representations and exactly `vocabulary_size`
for the `sparse` encoding, where `vocabulary_size` is
the number of different strings appearing in the training set
in the column the feature is named after (plus 1 for ``).
:type embedding_size: Integer
:param embeddings_trainable: If `True` embeddings are trained during
the training process, if `False` embeddings are fixed.
It may be useful when loading pretrained embeddings
for avoiding finetuning them. This parameter has effect only
for `representation` is `dense` as `sparse` one-hot encodings
are not trainable.
:type embeddings_trainable: Boolean
:param pretrained_embeddings: by default `dense` embeddings
are initialized randomly, but this parameter allows to specify
a path to a file containing embeddings in the GloVe format.
When the file containing the embeddings is loaded, only the
embeddings with labels present in the vocabulary are kept,
the others are discarded. If the vocabulary contains strings
that have no match in the embeddings file, their embeddings
are initialized with the average of all other embedding plus
some random noise to make them different from each other.
This parameter has effect only if `representation` is `dense`.
:type pretrained_embeddings: str (filepath)
:param embeddings_on_cpu: by default embeddings matrices are stored
on GPU memory if a GPU is used, as it allows
for faster access, but in some cases the embedding matrix
may be really big and this parameter forces the placement
of the embedding matrix in regular memroy and the CPU is used
to resolve them, slightly slowing down the process
as a result of data transfer between CPU and GPU memory.
:param conv_layers: it is a list of dictionaries containing
the parameters of all the convolutional layers. The length
of the list determines the number of parallel convolutional
layers and the content of each dictionary determines
the parameters for a specific layer. The available parameters
for each layer are: `filter_size`, `num_filters`, `pool`,
`norm`, `activation` and `regularize`. If any of those values
is missing from the dictionary, the default one specified
as a parameter of the encoder will be used instead. If both
`conv_layers` and `num_conv_layers` are `None`, a default
list will be assigned to `conv_layers` with the value
`[{filter_size: 2}, {filter_size: 3}, {filter_size: 4},
{filter_size: 5}]`.
:type conv_layers: List
:param num_conv_layers: if `conv_layers` is `None`, this is
the number of stacked convolutional layers.
:type num_conv_layers: Integer
:param filter_size: if a `filter_size` is not already specified in
`conv_layers` this is the default `filter_size` that
will be used for each layer. It indicates how wide is
the 1d convolutional filter.
:type filter_size: Integer
:param num_filters: if a `num_filters` is not already specified in
`conv_layers` this is the default `num_filters` that
will be used for each layer. It indicates the number
of filters, and by consequence the output channels of
the 1d convolution.
:type num_filters: Integer
:param pool_size: if a `pool_size` is not already specified
in `conv_layers` this is the default `pool_size` that
will be used for each layer. It indicates the size of
the max pooling that will be performed along the `s` sequence
dimension after the convolution operation.
:type pool_size: Integer
:param num_rec_layers: the number of stacked recurrent layers.
:type num_rec_layers: Integer
:param cell_type: the type of recurrent cell to use.
Available values are: `rnn`, `lstm`, `lstm_block`, `lstm`,
`ln`, `lstm_cudnn`, `gru`, `gru_block`, `gru_cudnn`.
For reference about the differences between the cells please
refer to PyTorch's documentation. We suggest to use the
`block` variants on CPU and the `cudnn` variants on GPU
because of their increased speed.
:type cell_type: str
:param state_size: the size of the state of the rnn.
:type state_size: Integer
:param bidirectional: if `True` two recurrent networks will perform
encoding in the forward and backward direction and
their outputs will be concatenated.
:type bidirectional: Boolean
:param dropout: determines if there should be a dropout layer before
returning the encoder output.
:type dropout: Boolean
:param initializer: the initializer to use. If `None` it uses
`xavier_uniform`. Options are: `constant`, `identity`,
`zeros`, `ones`, `orthogonal`, `normal`, `uniform`,
`truncated_normal`, `variance_scaling`, `xavier_normal`,
`xavier_uniform`, `xavier_normal`,
`he_normal`, `he_uniform`, `lecun_normal`, `lecun_uniform`.
Alternatively it is possible to specify a dictionary with
a key `type` that identifies the type of initializer and
other keys for its parameters,
e.g. `{type: normal, mean: 0, stddev: 0}`.
To know the parameters of each initializer, please refer
to PyTorch's documentation.
:type initializer: str
:param reduce_output: defines how to reduce the output tensor of
the convolutional layers along the `s` sequence length
dimension if the rank of the tensor is greater than 2.
Available values are: `sum`, `mean` or `avg`, `max`, `concat`
(concatenates along the first dimension), `last` (returns
the last vector of the first dimension) and `None` or `null`
(which does not reduce and returns the full tensor).
:type reduce_output: str
"""
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.max_sequence_length = max_sequence_length
self.should_embed = should_embed
self.should_project = False
self.embed_sequence = None
if self.should_embed:
logger.debug(" EmbedSequence")
self.embed_sequence = TokenAndPositionEmbedding(
max_sequence_length=max_sequence_length,
vocab=vocab,
embedding_size=embedding_size,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
# If vocab size is smaller than embedding size, embedding layer will use len(vocab) as embedding_size.
used_embedding_size = self.embed_sequence.output_shape[-1]
if used_embedding_size != hidden_size:
logger.debug(" project_to_embed_size")
self.project_to_hidden_size = nn.Linear(self.embed_sequence.output_shape[-1], hidden_size)
self.should_project = True
else:
logger.debug(" project_to_embed_size")
self.project_to_hidden_size = nn.Linear(embedding_size, hidden_size)
self.should_project = True
logger.debug(" TransformerStack")
self.transformer_stack = TransformerStack(
input_size=hidden_size,
max_sequence_length=max_sequence_length,
hidden_size=hidden_size,
num_heads=num_heads,
output_size=transformer_output_size,
num_layers=num_layers,
dropout=dropout,
)
self.reduce_output = reduce_output
self.reduce_sequence = SequenceReducer(
reduce_mode=reduce_output,
max_sequence_length=self.transformer_stack.output_shape[-2],
encoding_size=self.transformer_stack.output_shape[-1], # hidden_size
)
if self.reduce_output is None:
self.supports_masking = True
else:
logger.debug(" FCStack")
self.fc_stack = FCStack(
self.reduce_sequence.output_shape[-1],
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=fc_activation,
default_dropout=fc_dropout,
)
@staticmethod
def get_schema_cls() -> type[SequenceEncoderConfig]:
return StackedTransformerConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is not None:
return self.fc_stack.output_shape
return self.transformer_stack.output_shape
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
"""
:param inputs: The input sequence fed into the encoder.
Shape: [batch x sequence length], type torch.int32
:param mask: Input mask (unused, not yet implemented)
"""
# ================ Embeddings ================
if self.should_embed:
embedded_sequence = self.embed_sequence(inputs, mask=mask)
else:
embedded_sequence = inputs
while len(embedded_sequence.shape) < 3:
embedded_sequence = embedded_sequence.unsqueeze(-1)
# shape=(?, sequence_length, embedding_size)
if self.should_project:
hidden = self.project_to_hidden_size(embedded_sequence)
else:
hidden = embedded_sequence
# shape=(?, sequence_length, hidden)
# ================ Transformer Layers ================
hidden = self.transformer_stack(hidden, mask=mask)
# ================ Sequence Reduction ================
if self.reduce_output is not None:
hidden = self.reduce_sequence(hidden)
# ================ FC Layers ================
hidden = self.fc_stack(hidden, mask=mask)
return {ENCODER_OUTPUT: hidden}
================================================
FILE: ludwig/encoders/set_encoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from typing import Any
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ENCODER_OUTPUT, SET
from ludwig.encoders.base import Encoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.modules.embedding_modules import EmbedSet
from ludwig.modules.fully_connected_modules import FCStack
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.set_encoders import SetSparseEncoderConfig
logger = logging.getLogger(__name__)
@DeveloperAPI
@register_encoder("embed", SET)
class SetSparseEncoder(Encoder):
def __init__(
self,
vocab: list[str],
representation: str = "dense",
embedding_size: int = 50,
embeddings_trainable: bool = True,
pretrained_embeddings: str | None = None,
embeddings_on_cpu: bool = False,
fc_layers=None,
num_fc_layers: int = 0,
output_size: int = 10,
use_bias: bool = True,
weights_initializer: str = "xavier_uniform",
bias_initializer: str = "zeros",
norm: str | None = None,
norm_params: dict[str, Any] | None = None,
activation: str = "relu",
dropout: float = 0.0,
encoder_config=None,
**kwargs,
):
super().__init__()
self.config = encoder_config
logger.debug(f" {self.name}")
self.vocab_size = len(vocab)
logger.debug(" Embed")
self.embed = EmbedSet(
vocab,
embedding_size,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=weights_initializer,
)
logger.debug(" FCStack")
# TODO(shreya): Make sure this is updated when FCStack is updated
self.fc_stack = FCStack(
first_layer_input_size=self.embed.output_shape[-1],
layers=fc_layers,
num_layers=num_fc_layers,
default_output_size=output_size,
default_use_bias=use_bias,
default_weights_initializer=weights_initializer,
default_bias_initializer=bias_initializer,
default_norm=norm,
default_norm_params=norm_params,
default_activation=activation,
default_dropout=dropout,
)
def forward(self, inputs: torch.Tensor) -> EncoderOutputDict:
"""
Params:
inputs: The inputs fed into the encoder.
Shape: [batch x vocab_size], type tf.int32.
Returns:
Embeddings of shape [batch x vocab_size x embed size], type float32.
"""
hidden = self.embed(inputs)
hidden = self.fc_stack(hidden)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return SetSparseEncoderConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.vocab_size])
@property
def output_shape(self) -> torch.Size:
return self.fc_stack.output_shape
================================================
FILE: ludwig/encoders/text_encoders.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import inspect
import logging
from collections.abc import Callable
from typing import Any, TYPE_CHECKING, TypeVar
import numpy as np
import torch
from torch import nn
from transformers import AutoConfig
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ENCODER_OUTPUT, TEXT
from ludwig.encoders.base import Encoder
from ludwig.encoders.registry import register_encoder
from ludwig.encoders.types import EncoderOutputDict
from ludwig.modules.reduction_modules import SequenceReducer
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.sequence_encoders import SequenceEncoderConfig
from ludwig.schema.encoders.text_encoders import (
ALBERTConfig,
AutoTransformerConfig,
BERTConfig,
CamemBERTConfig,
CTRLConfig,
DebertaV2Config,
DistilBERTConfig,
ELECTRAConfig,
FlauBERTConfig,
GPT2Config,
GPTConfig,
LLMEncoderConfig,
LongformerConfig,
MT5Config,
RoBERTaConfig,
T5Config,
TfIdfEncoderConfig,
TransformerXLConfig,
XLMConfig,
XLMRoBERTaConfig,
XLNetConfig,
)
from ludwig.schema.llms.peft import adapter_registry, BaseAdapterConfig
from ludwig.utils.data_utils import clear_data_cache
from ludwig.utils.hf_utils import load_pretrained_hf_model_with_hub_fallback
from ludwig.utils.llm_utils import get_context_len, initialize_adapter, load_pretrained_from_config
from ludwig.utils.tokenizers import HFTokenizer
from ludwig.utils.torch_utils import FreezeModule
if TYPE_CHECKING:
from transformers import PretrainedConfig, PreTrainedModel
from ludwig.schema.encoders.text_encoders import HFEncoderConfig
logger = logging.getLogger(__name__)
def _cls_pooled_error_message(encoder: str):
# TODO(Arnav): Remove this once we have reduce_output options set for
# each encoder type in the schema
raise ValueError(f"reduce_output cannot be cls_pooled for {encoder}")
class HFTextEncoder(Encoder):
def _init_config(self, transformer, schema_keys: list[str], encoder_config: SequenceEncoderConfig):
"""Creates a config object for the encoder using the transformer model and the passed-in encoder config.
The transformer's config is only known after it is instantiated, so we must update the
encoder config with the values from the transformer config.
Args:
transformer: The transformer model.
schema_keys: The keys in the encoder config schema. We only want to update the encoder config
with the values from the transformer config that are in the schema.
encoder_config: The existing encoder config containing defaults and user-specified values.
If the values in this config differ from the transformer's config, the transformer's config
values will override this config's values.
Returns:
A new encoder config object with the updated values from the transformer config.
"""
transformer_config = transformer.config.to_dict()
final_hf_config_params = {k: v for k, v in transformer_config.items() if k in schema_keys}
encoder_config_dict = encoder_config.to_dict()
encoder_config_dict.update(final_hf_config_params)
return self.get_schema_cls().from_dict(encoder_config_dict)
def _init_transformer_from_scratch(
self, hf_model_cls: type, hf_config_cls: type, hf_config_params: dict[str, Any], vocab_size: int
):
"""Initializes the transformer model from scratch. This is in contrast to loading a pre-trained model.
Args:
hf_model_cls: The HuggingFace model class.
hf_config_cls: The HuggingFace config class.
hf_config_params: The HuggingFace config parameters exposed through the Ludwig schema.
vocab_size: The vocab size of the dataset. Because we are training from scratch, we can resize the
token embeddings table freely.
Returns:
The transformer model.
"""
config = hf_config_cls(**hf_config_params)
transformer = hf_model_cls(config)
self._maybe_resize_token_embeddings(transformer, vocab_size)
return transformer
def _maybe_resize_token_embeddings(self, transformer, vocab_size: int) -> None:
"""Resizes the token embeddings if the vocab size is different from the transformer's vocab size.
This should only happen if we are instantiating a model from scratch (i.e. not loading from a pretrained model
or checkpoint). Pretrained models update the vocab size stored in the config. This means if we are loading a
pretrained model from a checkpoint, the config vocab size should match the model's vocab size.
It is important that pretrained models update the vocab size stored in the config because sometimes the
pretrained models will have an embeddings table that is a different size than the vocab size. Examples:
CamemBERT: https://github.com/huggingface/tokenizers/issues/900#issue-1122256698
T5: https://github.com/huggingface/transformers/issues/4875#issue-635471552
Args:
transformer: The transformer model.
vocab_size: The vocab size of the dataset.
"""
if vocab_size != transformer.config.vocab_size:
transformer.resize_token_embeddings(vocab_size)
def _wrap_transformer(
self, transformer: nn.Module, adapter: BaseAdapterConfig | dict | None, trainable: bool
) -> nn.Module:
if adapter is not None:
from peft import get_peft_model
if isinstance(adapter, dict):
adapter_cls = adapter_registry[adapter["type"]]
adapter = adapter_cls.model_validate(adapter)
peft_config = adapter.to_config()
transformer = get_peft_model(transformer, peft_config)
logger.info("==================================================")
logger.info("Trainable Parameter Summary For Fine-Tuning:")
transformer.print_trainable_parameters()
logger.info("==================================================")
return FreezeModule(transformer, frozen=not trainable)
def get_embedding_layer(self) -> nn.Module:
return next(self.transformer.module.children())
HFModelT = TypeVar("HFModelT", bound="PreTrainedModel")
HFConfigT = TypeVar("HFConfigT", bound="PretrainedConfig")
ConfigT = TypeVar("ConfigT", bound="HFEncoderConfig")
class HFTextEncoderImpl(HFTextEncoder):
def __init__(
self,
model_cls: type[HFModelT],
config_cls: type[HFConfigT],
schema_cls: type[ConfigT],
max_sequence_length: int,
use_pretrained: bool,
pretrained_model_name_or_path: str,
saved_weights_in_checkpoint: bool,
reduce_output: str,
trainable: bool,
adapter: BaseAdapterConfig | None,
pretrained_kwargs: dict,
encoder_config: ConfigT | None,
**kwargs,
):
super().__init__()
# TODO(travis): get_hf_config_param_names should be implemented as abstract in HFEncoderConfig
vocab_size = kwargs["vocab_size"]
hf_config_params = {k: v for k, v in kwargs.items() if k in schema_cls.get_hf_config_param_names()}
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
model_cls, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(model_cls, config_cls, hf_config_params, vocab_size)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.reduce_output = reduce_output
if not self.reduce_output == "cls_pooled":
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
if self.reduce_output == "cls_pooled":
hidden = transformer_outputs["pooler_output"]
else:
hidden = transformer_outputs["last_hidden_state"][:, 1:-1, :] # bos + [sent] + sep
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
return torch.Size([self.max_sequence_length - 2, self.transformer.module.config.hidden_size])
if self.reduce_output == "concat":
return torch.Size(
[
(self.max_sequence_length - 2) * self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("albert", TEXT)
class ALBERTEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "albert-base-v2"
def __init__(
self,
max_sequence_length,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
reduce_output: str = "cls_pooled",
vocab_size: int = 30000,
embedding_size: int = 128,
hidden_size: int = 4096,
num_hidden_layers: int = 12,
num_hidden_groups: int = 1,
num_attention_heads: int = 64,
intermediate_size: int = 16384,
inner_group_num: int = 1,
hidden_act: str = "gelu_new",
hidden_dropout_prob: float = 0,
attention_probs_dropout_prob: float = 0,
max_position_embeddings: int = 512,
type_vocab_size: int = 2,
initializer_range: float = 0.02,
layer_norm_eps: float = 1e-12,
classifier_dropout_prob: float = 0.1,
position_embedding_type: str = "absolute",
pad_token_id: int = 0,
bos_token_id: int = 2,
eos_token_id: int = 3,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import AlbertConfig, AlbertModel
hf_config_params = dict(
vocab_size=vocab_size,
embedding_size=embedding_size,
hidden_size=hidden_size,
num_hidden_layers=num_hidden_layers,
num_hidden_groups=num_hidden_groups,
num_attention_heads=num_attention_heads,
intermediate_size=intermediate_size,
inner_group_num=inner_group_num,
hidden_act=hidden_act,
hidden_dropout_prob=hidden_dropout_prob,
attention_probs_dropout_prob=attention_probs_dropout_prob,
max_position_embeddings=max_position_embeddings,
type_vocab_size=type_vocab_size,
initializer_range=initializer_range,
layer_norm_eps=layer_norm_eps,
classifier_dropout_prob=classifier_dropout_prob,
position_embedding_type=position_embedding_type,
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
AlbertModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(AlbertModel, AlbertConfig, hf_config_params, vocab_size)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.reduce_output = reduce_output
if not self.reduce_output == "cls_pooled":
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
if self.reduce_output == "cls_pooled":
hidden = transformer_outputs[1]
else:
hidden = transformer_outputs[0][:, 1:-1, :]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return ALBERTConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by BERT tokenizer.
return torch.Size(
[
self.max_sequence_length - 2,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("mt5", TEXT)
class MT5Encoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "google/mt5-base"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
reduce_output: str = "sum",
vocab_size: int = 250112,
d_model: int = 512,
d_kv: int = 64,
d_ff: int = 1024,
num_layers: int = 8,
num_decoder_layers: int = None,
num_heads: int = 6,
relative_attention_num_buckets: int = 32,
dropout_rate: float = 0.1,
layer_norm_epsilon: float = 1e-06,
initializer_factor: float = 1.0,
feed_forward_proj: str = "gated-gelu",
is_encoder_decoder: bool = True,
use_cache: bool = True,
tokenizer_class: str = "T5Tokenizer",
tie_word_embeddings: bool = False,
pad_token_id: int = 0,
eos_token_id: int = 1,
decoder_start_token_id: int = 0,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import MT5Config, MT5EncoderModel
hf_config_params = dict(
vocab_size=vocab_size,
d_model=d_model,
d_kv=d_kv,
d_ff=d_ff,
num_layers=num_layers,
num_decoder_layers=num_decoder_layers,
num_heads=num_heads,
relative_attention_num_buckets=relative_attention_num_buckets,
dropout_rate=dropout_rate,
layer_norm_epsilon=layer_norm_epsilon,
initializer_factor=initializer_factor,
feed_forward_proj=feed_forward_proj,
is_encoder_decoder=is_encoder_decoder,
use_cache=use_cache,
tokenizer_class=tokenizer_class,
tie_word_embeddings=tie_word_embeddings,
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
decoder_start_token_id=decoder_start_token_id,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
MT5EncoderModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(MT5EncoderModel, MT5Config, hf_config_params, vocab_size)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.reduce_output = reduce_output
if reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
)
hidden = transformer_outputs[0][:, 1:-1, :]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return MT5Config
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by MT5 tokenizer.
return torch.Size(
[
self.max_sequence_length - 2,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("xlmroberta", TEXT)
class XLMRoBERTaEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "xlm-roberta-base"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "cls_pooled",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = None,
pad_token_id: int = 1,
bos_token_id: int = 0,
eos_token_id: int = 2,
max_position_embeddings: int = 514,
type_vocab_size: int = 1,
add_pooling_layer: bool = True,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import XLMRobertaConfig, XLMRobertaModel
hf_config_params = dict(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
max_position_embeddings=max_position_embeddings,
type_vocab_size=type_vocab_size,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
XLMRobertaModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(
XLMRobertaModel, XLMRobertaConfig, hf_config_params, vocab_size
)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.reduce_output = reduce_output
if not self.reduce_output == "cls_pooled":
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
if self.reduce_output == "cls_pooled":
hidden = transformer_outputs[1]
else:
hidden = transformer_outputs[0][:, 1:-1, :]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return XLMRoBERTaConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by XLMRoberta tokenizer.
return torch.Size(
[
self.max_sequence_length - 2,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("bert", TEXT)
class BERTEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "bert-base-uncased"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
reduce_output: str = "cls_pooled",
vocab_size: int = 30522,
hidden_size: int = 768,
num_hidden_layers: int = 12,
num_attention_heads: int = 12,
intermediate_size: int = 3072,
hidden_act: str | Callable = "gelu",
hidden_dropout_prob: float = 0.1,
attention_probs_dropout_prob: float = 0.1,
max_position_embeddings: int = 512,
type_vocab_size: int = 2,
initializer_range: float = 0.02,
layer_norm_eps: float = 1e-12,
pad_token_id: int = 0,
gradient_checkpointing: bool = False,
position_embedding_type: str = "absolute",
classifier_dropout: float = None,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import BertConfig, BertModel
hf_config_params = dict(
vocab_size=vocab_size,
hidden_size=hidden_size,
num_hidden_layers=num_hidden_layers,
num_attention_heads=num_attention_heads,
intermediate_size=intermediate_size,
hidden_act=hidden_act,
hidden_dropout_prob=hidden_dropout_prob,
attention_probs_dropout_prob=attention_probs_dropout_prob,
max_position_embeddings=max_position_embeddings,
type_vocab_size=type_vocab_size,
initializer_range=initializer_range,
layer_norm_eps=layer_norm_eps,
pad_token_id=pad_token_id,
gradient_checkpointing=gradient_checkpointing,
position_embedding_type=position_embedding_type,
classifier_dropout=classifier_dropout,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
BertModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(BertModel, BertConfig, hf_config_params, vocab_size)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.reduce_output = reduce_output
if not self.reduce_output == "cls_pooled":
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
if self.reduce_output == "cls_pooled":
hidden = transformer_outputs[1]
else:
hidden = transformer_outputs[0][:, 1:-1, :]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return BERTConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
# TODO(shreya): Confirm that this is it
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by BERT tokenizer.
return torch.Size(
[
self.max_sequence_length - 2,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("xlm", TEXT)
class XLMEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "xlm-mlm-en-2048"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
reduce_output: str = "sum",
vocab_size: int = 30145,
emb_dim: int = 2048,
n_layers: int = 12,
n_heads: int = 16,
dropout: float = 0.1,
attention_dropout: float = 0.1,
gelu_activation: bool = True,
sinusoidal_embeddings: bool = False,
causal: bool = False,
asm: bool = False,
n_langs: int = 1,
use_lang_emb: bool = True,
max_position_embeddings: int = 512,
embed_init_std: float = 2048**-0.5,
layer_norm_eps: float = 1e-12,
init_std: float = 0.02,
bos_index: int = 0,
eos_index: int = 1,
pad_index: int = 2,
unk_index: int = 3,
mask_index: int = 5,
is_encoder: bool = True,
start_n_top: int = 5,
end_n_top: int = 5,
mask_token_id: int = 0,
lang_id: int = 0,
pad_token_id: int = 2,
bos_token_id: int = 0,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import XLMConfig, XLMModel
hf_config_params = dict(
vocab_size=vocab_size,
emb_dim=emb_dim,
n_layers=n_layers,
n_heads=n_heads,
dropout=dropout,
attention_dropout=attention_dropout,
gelu_activation=gelu_activation,
sinusoidal_embeddings=sinusoidal_embeddings,
causal=causal,
asm=asm,
n_langs=n_langs,
use_lang_emb=use_lang_emb,
max_position_embeddings=max_position_embeddings,
embed_init_std=embed_init_std,
layer_norm_eps=layer_norm_eps,
init_std=init_std,
bos_index=bos_index,
eos_index=eos_index,
pad_index=pad_index,
unk_index=unk_index,
mask_index=mask_index,
is_encoder=is_encoder,
start_n_top=start_n_top,
end_n_top=end_n_top,
mask_token_id=mask_token_id,
lang_id=lang_id,
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
XLMModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(XLMModel, XLMConfig, hf_config_params, vocab_size)
self.config = self._init_config(transformer, hf_config_params, encoder_config)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
hidden = transformer_outputs[0]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return XLMConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
# TODO(shreya): Confirm that this is it
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by BERT tokenizer.
return torch.Size(
[
self.max_sequence_length - 2,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("gpt", TEXT)
class GPTEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "openai-gpt"
def __init__(
self,
max_sequence_length: int,
reduce_output: str = "sum",
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 30522,
n_positions: int = 40478,
n_ctx: int = 512,
n_embd: int = 768,
n_layer: int = 12,
n_head: int = 12,
afn: str = "gelu",
resid_pdrop: float = 0.1,
embd_pdrop: float = 0.1,
attn_pdrop: float = 0.1,
layer_norm_epsilon: float = 1e-5,
initializer_range: float = 0.02,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import OpenAIGPTConfig, OpenAIGPTModel
hf_config_params = dict(
vocab_size=vocab_size,
n_positions=n_positions,
n_ctx=n_ctx,
n_embd=n_embd,
n_layer=n_layer,
n_head=n_head,
afn=afn,
resid_pdrop=resid_pdrop,
embd_pdrop=embd_pdrop,
attn_pdrop=attn_pdrop,
layer_norm_epsilon=layer_norm_epsilon,
initializer_range=initializer_range,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
OpenAIGPTModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(
OpenAIGPTModel, OpenAIGPTConfig, hf_config_params, vocab_size
)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
hidden = transformer_outputs[0]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return GPTConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
return torch.Size([self.max_sequence_length, self.transformer.module.config.hidden_size])
elif self.reduce_output == "concat":
return torch.Size([self.transformer.module.config.hidden_size * self.max_sequence_length])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("gpt2", TEXT)
class GPT2Encoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "gpt2"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
reduce_output: str = "sum",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 50257,
n_positions: int = 1024,
n_ctx: int = 1024,
n_embd: int = 768,
n_layer: int = 12,
n_head: int = 12,
n_inner: int | None = None,
activation_function: str = "gelu",
resid_pdrop: float = 0.1,
embd_pdrop: float = 0.1,
attn_pdrop: float = 0.1,
layer_norm_epsilon: float = 1e-5,
initializer_range: float = 0.02,
scale_attn_weights: bool = True,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import GPT2Config, GPT2Model
hf_config_params = dict(
vocab_size=vocab_size,
n_positions=n_positions,
n_ctx=n_ctx,
n_embd=n_embd,
n_layer=n_layer,
n_head=n_head,
n_inner=n_inner,
activation_function=activation_function,
resid_pdrop=resid_pdrop,
embd_pdrop=embd_pdrop,
attn_pdrop=attn_pdrop,
layer_norm_epsilon=layer_norm_epsilon,
initializer_range=initializer_range,
scale_attn_weights=scale_attn_weights,
)
if use_pretrained:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
GPT2Model, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(GPT2Model, GPT2Config, hf_config_params, vocab_size)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
hidden = transformer_outputs[0]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return GPT2Config
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
return torch.Size([self.max_sequence_length, self.transformer.module.config.hidden_size])
elif self.reduce_output == "concat":
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("deberta", TEXT)
class DeBERTaEncoder(HFTextEncoderImpl):
def __init__(self, *args, **kwargs):
from transformers import DebertaV2Config as _DebertaV2Config
from transformers import DebertaV2Model
super().__init__(DebertaV2Model, _DebertaV2Config, DebertaV2Config, *args, **kwargs)
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return DebertaV2Config
@DeveloperAPI
@register_encoder("roberta", TEXT)
class RoBERTaEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "roberta-base"
def __init__(
self,
max_sequence_length,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "cls_pooled",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = None,
pad_token_id: int = 1,
bos_token_id: int = 0,
eos_token_id: int = 2,
max_position_embeddings: int = 514,
type_vocab_size: int = 1,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import RobertaConfig, RobertaModel
hf_config_params = dict(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
max_position_embeddings=max_position_embeddings,
type_vocab_size=type_vocab_size,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
RobertaModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(RobertaModel, RobertaConfig, hf_config_params, vocab_size)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
self.reduce_output = reduce_output
if not self.reduce_output == "cls_pooled":
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
if self.reduce_output == "cls_pooled":
hidden = transformer_outputs[1]
else:
hidden = transformer_outputs[0][:, 1:-1, :] # bos + [sent] + sep
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return RoBERTaConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
return torch.Size([self.max_sequence_length - 2, self.transformer.module.config.hidden_size])
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("transformer_xl", TEXT)
class TransformerXLEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "transfo-xl-wt103"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "sum",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 267735,
cutoffs: list[int] = [20000, 40000, 200000],
d_model: int = 1024,
d_embed: int = 1024,
n_head: int = 16,
d_head: int = 64,
d_inner: int = 4096,
div_val: int = 4,
pre_lnorm: bool = False,
n_layer: int = 18,
mem_len: int = 1600,
clamp_len: int = 1000,
same_length: bool = True,
proj_share_all_but_first: bool = True,
attn_type: int = 0,
sample_softmax: int = -1,
adaptive: bool = True,
dropout: float = 0.1,
dropatt: float = 0.0,
untie_r: bool = True,
init: str = "normal",
init_range: float = 0.01,
proj_init_std: float = 0.01,
init_std: float = 0.02,
layer_norm_epsilon: float = 1e-5,
eos_token_id: int = 0,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import TransfoXLConfig, TransfoXLModel
hf_config_params = dict(
vocab_size=vocab_size,
cutoffs=cutoffs,
d_model=d_model,
d_embed=d_embed,
n_head=n_head,
d_head=d_head,
d_inner=d_inner,
div_val=div_val,
pre_lnorm=pre_lnorm,
n_layer=n_layer,
mem_len=mem_len,
clamp_len=clamp_len,
same_length=same_length,
proj_share_all_but_first=proj_share_all_but_first,
attn_type=attn_type,
sample_softmax=sample_softmax,
adaptive=adaptive,
dropout=dropout,
dropatt=dropatt,
untie_r=untie_r,
init=init,
init_range=init_range,
proj_init_std=proj_init_std,
init_std=init_std,
layer_norm_epsilon=layer_norm_epsilon,
eos_token_id=eos_token_id,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
TransfoXLModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
config = TransfoXLConfig(**hf_config_params)
transformer = TransfoXLModel(config)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
transformer_outputs = self.transformer.module(inputs)
hidden = transformer_outputs[0]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TransformerXLConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
return torch.Size([self.max_sequence_length, self.transformer.module.config.d_model])
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.d_model * self.max_sequence_length])
return torch.Size([self.transformer.module.config.d_model])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("xlnet", TEXT)
class XLNetEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "xlnet-base-cased"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "sum",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 32000,
d_model: int = 1024,
n_layer: int = 24,
n_head: int = 16,
d_inner: int = 4096,
ff_activation: str = "gelu",
untie_r: bool = True,
attn_type: str = "bi",
initializer_range: float = 0.02,
layer_norm_eps: float = 1e-12,
dropout: float = 0.1,
mem_len: int | None = 512,
reuse_len: int | None = None,
use_mems_eval: bool = True,
use_mems_train: bool = False,
bi_data: bool = False,
clamp_len: int = -1,
same_length: bool = False,
summary_type: str = "last",
summary_use_proj: bool = True,
summary_activation: str = "tanh",
summary_last_dropout: float = 0.1,
start_n_top: int = 5,
end_n_top: int = 5,
pad_token_id: int = 5,
bos_token_id: int = 1,
eos_token_id: int = 2,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import XLNetConfig, XLNetModel
hf_config_params = dict(
vocab_size=vocab_size,
d_model=d_model,
n_layer=n_layer,
n_head=n_head,
d_inner=d_inner,
ff_activation=ff_activation,
untie_r=untie_r,
attn_type=attn_type,
initializer_range=initializer_range,
layer_norm_eps=layer_norm_eps,
dropout=dropout,
mem_len=mem_len,
reuse_len=reuse_len,
use_mems_eval=use_mems_eval,
use_mems_train=use_mems_train,
bi_data=bi_data,
clamp_len=clamp_len,
same_length=same_length,
summary_type=summary_type,
summary_use_proj=summary_use_proj,
summary_activation=summary_activation,
summary_last_dropout=summary_last_dropout,
start_n_top=start_n_top,
end_n_top=end_n_top,
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
XLNetModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(XLNetModel, XLNetConfig, hf_config_params, vocab_size)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.max_sequence_length = max_sequence_length
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
hidden = transformer_outputs[0]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return XLNetConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
return torch.Size([self.max_sequence_length, self.transformer.module.config.d_model])
elif self.reduce_output == "concat":
return torch.Size([self.transformer.module.config.d_model * self.max_sequence_length])
return torch.Size([self.transformer.module.config.d_model])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("distilbert", TEXT)
class DistilBERTEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "distilbert-base-uncased"
def __init__(
self,
max_sequence_length: int,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "sum",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
use_pretrained: bool = True,
vocab_size: int = 30522,
max_position_embeddings: int = 512,
sinusoidal_pos_embds: bool = False,
n_layers: int = 6,
n_heads: int = 12,
dim: int = 768,
hidden_dim: int = 3072,
dropout: float = 0.1,
attention_dropout: float = 0.1,
activation: str | Callable = "gelu",
initializer_range: float = 0.02,
qa_dropout: float = 0.1,
seq_classif_dropout: float = 0.2,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import DistilBertConfig, DistilBertModel
hf_config_params = dict(
vocab_size=vocab_size,
max_position_embeddings=max_position_embeddings,
sinusoidal_pos_embds=sinusoidal_pos_embds,
n_layers=n_layers,
n_heads=n_heads,
dim=dim,
hidden_dim=hidden_dim,
dropout=dropout,
attention_dropout=attention_dropout,
activation=activation,
initializer_range=initializer_range,
qa_dropout=qa_dropout,
seq_classif_dropout=seq_classif_dropout,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
DistilBertModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(
DistilBertModel, DistilBertConfig, hf_config_params, vocab_size
)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.max_sequence_length = max_sequence_length
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.last_inputs = None
self.last_hidden = None
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
)
hidden = transformer_outputs[0][:, 1:-1, :]
self.last_inputs = inputs
self.last_hidden = hidden
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return DistilBERTConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by BERT tokenizer.
return torch.Size([self.max_sequence_length - 2, self.transformer.module.config.dim])
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.dim * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.dim])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("ctrl", TEXT)
class CTRLEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "ctrl"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "sum",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 246534,
n_positions: int = 256,
n_ctx: int = 256,
n_embd: int = 1280,
dff: int = 8192,
n_layer: int = 48,
n_head: int = 16,
resid_pdrop: float = 0.1,
embd_pdrop: float = 0.1,
attn_pdrop: float = 0.1,
layer_norm_epsilon: float = 1e-6,
initializer_range: float = 0.02,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import CTRLConfig, CTRLModel
hf_config_params = dict(
vocab_size=vocab_size,
n_positions=n_positions,
n_ctx=n_ctx,
n_embd=n_embd,
dff=dff,
n_layer=n_layer,
n_head=n_head,
resid_pdrop=resid_pdrop,
embd_pdrop=embd_pdrop,
attn_pdrop=attn_pdrop,
layer_norm_epsilon=layer_norm_epsilon,
initializer_range=initializer_range,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
CTRLModel, pretrained_model_name_or_path, **pretrained_kwargs
)
self.vocab_size = transformer.config.vocab_size
else:
transformer = self._init_transformer_from_scratch(CTRLModel, CTRLConfig, hf_config_params, vocab_size)
self.vocab_size = vocab_size
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.max_sequence_length = max_sequence_length
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
hidden = transformer_outputs[0]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls():
return CTRLConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
return torch.Size([self.max_sequence_length, self.transformer.module.config.n_embd])
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.n_embd * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.n_embd])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("camembert", TEXT)
class CamemBERTEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "camembert-base"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "cls-pooled",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 30522,
hidden_size: int = 768,
num_hidden_layers: int = 12,
num_attention_heads: int = 12,
intermediate_size: int = 3072,
hidden_act: str | Callable = "gelu",
hidden_dropout_prob: float = 0.1,
attention_probs_dropout_prob: float = 0.1,
max_position_embeddings: int = 512,
type_vocab_size: int = 2,
initializer_range: float = 0.02,
layer_norm_eps: float = 1e-12,
pad_token_id: int = 0,
gradient_checkpointing: bool = False,
position_embedding_type: str = "absolute",
classifier_dropout: float = None,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import CamembertConfig, CamembertModel
hf_config_params = dict(
vocab_size=vocab_size,
hidden_size=hidden_size,
num_hidden_layers=num_hidden_layers,
num_attention_heads=num_attention_heads,
intermediate_size=intermediate_size,
hidden_act=hidden_act,
hidden_dropout_prob=hidden_dropout_prob,
attention_probs_dropout_prob=attention_probs_dropout_prob,
max_position_embeddings=max_position_embeddings,
type_vocab_size=type_vocab_size,
initializer_range=initializer_range,
layer_norm_eps=layer_norm_eps,
pad_token_id=pad_token_id,
gradient_checkpointing=gradient_checkpointing,
position_embedding_type=position_embedding_type,
classifier_dropout=classifier_dropout,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
CamembertModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(
CamembertModel, CamembertConfig, hf_config_params, vocab_size
)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.reduce_output = reduce_output
if not self.reduce_output == "cls_pooled":
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
if self.reduce_output == "cls_pooled":
hidden = transformer_outputs[1]
else:
hidden = transformer_outputs[0][:, 1:-1, :]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return CamemBERTConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by BERT tokenizer.
return torch.Size(
[
self.max_sequence_length - 2,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("t5", TEXT)
class T5Encoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "t5-small"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "sum",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 32128,
d_model: int = 512,
d_kv: int = 64,
d_ff: int = 2048,
num_layers: int = 6,
num_decoder_layers: int | None = None,
num_heads: int = 8,
relative_attention_num_buckets: int = 32,
dropout_rate: float = 0.1,
layer_norm_eps: float = 1e-6,
initializer_factor: float = 1,
feed_forward_proj: str = "relu",
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import T5Config, T5Model
hf_config_params = dict(
vocab_size=vocab_size,
d_model=d_model,
d_kv=d_kv,
d_ff=d_ff,
num_layers=num_layers,
num_decoder_layers=num_decoder_layers,
num_heads=num_heads,
relative_attention_num_buckets=relative_attention_num_buckets,
dropout_rate=dropout_rate,
layer_norm_eps=layer_norm_eps,
initializer_factor=initializer_factor,
feed_forward_proj=feed_forward_proj,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
T5Model, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(T5Model, T5Config, hf_config_params, vocab_size)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.max_sequence_length = max_sequence_length
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
inputs,
decoder_input_ids=inputs,
attention_mask=mask,
)
hidden = transformer_outputs[0][:, 0:-1, :] # [eos token]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return T5Config
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 1 to remove EOS token added by T5 tokenizer.
return torch.Size(
[
self.max_sequence_length - 1,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -1 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 1)])
return torch.Size([self.transformer.module.config.d_model])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("flaubert", TEXT)
class FlauBERTEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "flaubert/flaubert_small_cased"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "sum",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 30145,
pre_norm: bool = False,
layerdrop: float = 0.0,
emb_dim: int = 2048,
n_layers: int = 12,
n_heads: int = 16,
dropout: float = 0.1,
attention_dropout: float = 0.1,
gelu_activation: bool = True,
sinusoidal_embeddings: bool = False,
causal: bool = False,
asm: bool = False,
n_langs: int = 1,
use_lang_emb: bool = True,
max_position_embeddings: int = 512,
embed_init_std: float = 2048**-0.5,
init_std: int = 0.02,
layer_norm_eps: float = 1e-12,
bos_index: int = 0,
eos_index: int = 1,
pad_index: int = 2,
unk_index: int = 3,
mask_index: int = 5,
is_encoder: bool = True,
mask_token_id: int = 0,
lang_id: int = 1,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import FlaubertConfig, FlaubertModel
hf_config_params = dict(
vocab_size=vocab_size,
pre_norm=pre_norm,
layerdrop=layerdrop,
emb_dim=emb_dim,
n_layers=n_layers,
n_heads=n_heads,
dropout=dropout,
attention_dropout=dropout,
gelu_activation=gelu_activation,
sinusoidal_embeddings=sinusoidal_embeddings,
causal=causal,
asm=asm,
n_langs=n_langs,
use_lang_emb=use_lang_emb,
max_position_embeddings=max_position_embeddings,
embed_init_std=embed_init_std,
init_std=init_std,
layer_norm_eps=layer_norm_eps,
bos_index=bos_index,
eos_index=eos_index,
pad_index=pad_index,
unk_index=unk_index,
mask_index=mask_index,
is_encoder=is_encoder,
mask_token_id=mask_token_id,
lang_id=lang_id,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
FlaubertModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(
FlaubertModel, FlaubertConfig, hf_config_params, vocab_size
)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.max_sequence_length = max_sequence_length
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
hidden = transformer_outputs[0][:, 1:-1, :]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return FlauBERTConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by tokenizer.
return torch.Size(
[
self.max_sequence_length - 2,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.emb_dim])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("electra", TEXT)
class ELECTRAEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "google/electra-small-discriminator"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str = "sum",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 30522,
embedding_size: int = 128,
hidden_size: int = 256,
num_hidden_layers: int = 12,
num_attention_heads: int = 4,
intermediate_size: int = 1024,
hidden_act: str | Callable = "gelu",
hidden_dropout_prob: float = 0.1,
attention_probs_dropout_prob: float = 0.1,
max_position_embeddings: int = 512,
type_vocab_size: int = 2,
initializer_range: float = 0.02,
layer_norm_eps: float = 1e-12,
position_embedding_type: str = "absolute",
classifier_dropout: float | None = None,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import ElectraConfig, ElectraModel
hf_config_params = dict(
vocab_size=vocab_size,
embedding_size=embedding_size,
hidden_size=hidden_size,
num_hidden_layers=num_hidden_layers,
num_attention_heads=num_attention_heads,
intermediate_size=intermediate_size,
hidden_act=hidden_act,
hidden_dropout_prob=hidden_dropout_prob,
attention_probs_dropout_prob=attention_probs_dropout_prob,
max_position_embeddings=max_position_embeddings,
type_vocab_size=type_vocab_size,
initializer_range=initializer_range,
layer_norm_eps=layer_norm_eps,
position_embedding_type=position_embedding_type,
classifier_dropout=classifier_dropout,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
ElectraModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(ElectraModel, ElectraConfig, hf_config_params, vocab_size)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.max_sequence_length = max_sequence_length
self.reduce_output = reduce_output
if self.reduce_output == "cls_pooled":
_cls_pooled_error_message(self.__class__.__name__)
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
hidden = transformer_outputs[0][:, 1:-1, :]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return ELECTRAConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by tokenizer.
return torch.Size(
[
self.max_sequence_length - 2,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("longformer", TEXT)
class LongformerEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = "allenai/longformer-base-4096"
def __init__(
self,
max_sequence_length: int,
use_pretrained: bool = True,
attention_window: list[int] | int = 512,
sep_token_id: int = 2,
pretrained_model_name_or_path: str = DEFAULT_MODEL_NAME,
saved_weights_in_checkpoint: bool = False,
reduce_output: str | None = "cls_pooled",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int = 50265,
num_tokens: int | None = None,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import LongformerConfig, LongformerModel
hf_config_params = dict(
attention_window=attention_window,
sep_token_id=sep_token_id,
vocab_size=vocab_size,
**kwargs,
)
if use_pretrained and not saved_weights_in_checkpoint:
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
LongformerModel, pretrained_model_name_or_path, **pretrained_kwargs
)
else:
transformer = self._init_transformer_from_scratch(
LongformerModel, LongformerConfig, hf_config_params, vocab_size
)
if encoder_config is not None:
self.config = self._init_config(transformer, hf_config_params.keys(), encoder_config)
else:
self.config = None
self.reduce_output = reduce_output
if not self.reduce_output == "cls_pooled":
self.reduce_sequence = SequenceReducer(reduce_mode=reduce_output)
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.max_sequence_length = max_sequence_length
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
transformer_outputs = self.transformer.module(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
if self.reduce_output == "cls_pooled":
hidden = transformer_outputs[1]
else:
hidden = transformer_outputs[0][:, 1:-1, :] # bos + [sent] + sep
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return LongformerConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# Subtract 2 to remove CLS and PAD tokens added by Longformer (== Roberta) tokenizer.
return torch.Size(
[
self.max_sequence_length - 2,
self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("auto_transformer", TEXT)
class AutoTransformerEncoder(HFTextEncoder):
DEFAULT_MODEL_NAME = None
def __init__(
self,
pretrained_model_name_or_path: str,
max_sequence_length: int,
reduce_output: str = "sum",
trainable: bool = False,
adapter: BaseAdapterConfig | None = None,
vocab_size: int | None = None,
pretrained_kwargs: dict = None,
encoder_config=None,
**kwargs,
):
super().__init__()
from transformers import AutoModel
pretrained_kwargs = pretrained_kwargs or {}
transformer, _ = load_pretrained_hf_model_with_hub_fallback(
AutoModel, pretrained_model_name_or_path, **pretrained_kwargs
)
self._maybe_resize_token_embeddings(transformer, vocab_size)
self.config = self._init_config(transformer, [], encoder_config)
# Precompute the set of params that are included in the forward signature of the AutoModel implementation so
# we can filter out unused params during the `forward` call.
self.forward_kwargs = set(inspect.signature(transformer.forward).parameters.keys())
self.transformer = self._wrap_transformer(transformer, adapter, trainable)
self.reduce_output = reduce_output
if self.reduce_output != "cls_pooled":
self.reduce_sequence = SequenceReducer(
reduce_mode=reduce_output, encoding_size=self.transformer.module.config.hidden_size
)
self.max_sequence_length = max_sequence_length
def _maybe_resize_token_embeddings(self, transformer, vocab_size: int | None = None):
"""Overridden because AutoModel should use its own vocab size unless vocab size is explicitly specified."""
if vocab_size is not None:
transformer.resize_token_embeddings(vocab_size)
self.vocab_size = vocab_size
else:
self.vocab_size = transformer.config.vocab_size
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
if mask is not None:
mask = mask.to(torch.int32)
# The forward signature of AutoModel is not consistent across implementations, so we need to make sure we're
# only passing in params included in the forward signature.
kwargs = dict(
input_ids=inputs,
attention_mask=mask,
token_type_ids=torch.zeros_like(inputs),
)
kwargs = {k: v for k, v in kwargs.items() if k in self.forward_kwargs}
transformer_outputs = self.transformer.module(**kwargs)
if self.reduce_output == "cls_pooled":
# this works only if the user know that the specific model
# they want to use has the same outputs of
# the BERT base class call() function
hidden = transformer_outputs["pooler_output"]
else:
hidden = transformer_outputs["last_hidden_state"]
hidden = self.reduce_sequence(hidden, self.reduce_output)
return {ENCODER_OUTPUT: hidden}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return AutoTransformerConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
if self.reduce_output is None:
# TODO(justin): This may need to be conditioned on which AutoModel gets chosen.
return torch.Size([self.max_sequence_length, self.transformer.module.config.hidden_size])
if self.reduce_output == "concat":
return torch.Size(
[
self.max_sequence_length * self.transformer.module.config.hidden_size,
]
)
elif self.reduce_output == "concat":
# add the -2 to account of start and end tokens.
return torch.Size([self.transformer.module.config.hidden_size * (self.max_sequence_length - 2)])
return torch.Size([self.transformer.module.config.hidden_size])
@property
def input_dtype(self) -> torch.dtype:
return torch.int32
@DeveloperAPI
@register_encoder("tf_idf", [TEXT])
class TfIdfEncoder(Encoder):
def __init__(
self,
max_sequence_length: int,
encoder_config=None,
str2idf=None,
vocab=None,
vocab_size: int = None,
**kwargs,
):
super().__init__()
self.config = encoder_config
self.max_sequence_length = max_sequence_length
self.vocab_size = vocab_size
logger.debug(f" {self.name}")
# Convert mapping of token -> frequency to a dense array
idf = np.zeros(vocab_size)
for i, s in enumerate(vocab):
idf[i] = str2idf[s]
self.register_buffer("idf", torch.from_numpy(idf).float().unsqueeze(0))
def forward(self, t: torch.Tensor, mask: torch.Tensor | None = None) -> EncoderOutputDict:
# Compute the term frequency within each row
tf = torch.stack([t_i.bincount(minlength=self.vocab_size) for t_i in torch.unbind(t.long())])
# Normalize the term frequency by the number of tokens in each row
tf = tf / tf.sum(dim=1).unsqueeze(-1)
# Multiply the term frequency by the inverse document frequency
tfidf = tf * self.idf
return {ENCODER_OUTPUT: tfidf}
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return TfIdfEncoderConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.vocab_size])
def get_embedding_layer(self) -> nn.Module:
return self
@DeveloperAPI
@register_encoder("llm", [TEXT])
class LLMEncoder(Encoder):
# Per-adapter type prefixes for parameter names in the state dict, taken from
# https://github.com/huggingface/peft/blob/0f1e9091cc975eb5458cc163bf1843a34fb42b76/src/peft/utils/save_and_load.py#L173C9-L180
ADAPTER_PARAM_NAME_PREFIX = {
"adalora": "lora_",
"ia3": "ia3_",
"lora": "lora_",
}
def __init__(self, encoder_config: LLMEncoderConfig = None, **kwargs):
super().__init__()
self.register_load_state_dict_post_hook(self.remove_missing_non_adapter_keys)
self.config = encoder_config
self.adapter_is_initialized = False
self.model_name = self.config.base_model
self.model_config = AutoConfig.from_pretrained(self.config.base_model)
self.model = load_pretrained_from_config(self.config, model_config=self.model_config)
self.curr_device = next(self.model.parameters()).device
logger.info("Done.")
self.context_len = get_context_len(self.model_config)
# TODO(Arnav): This needs be more flexible to account for RoPE Scaling
# When merging input IDs and target IDs for LLM fine-tuning, we want to make sure that the merged tensor is
# not longer than the global maximum sequence length. This is provided in the preprocessing config. We never
# want to exceed the maximum possible context length so we also check for that.
if self.config.max_sequence_length:
max_sequence_length = self.config.max_sequence_length
self.max_sequence_length = (
max_sequence_length if max_sequence_length <= self.context_len else self.context_len
)
else:
self.max_sequence_length = self.context_len
# Initialize tokenizer
self.tokenizer = HFTokenizer(self.config.base_model).tokenizer
self.attention_masks = None
clear_data_cache()
# Because we use the last hidden state as encoder output rather than the logits, the final module of the model
# has input pass through but no gradient update in the backward pass. This can lead to a DDP error. Freezing
# the module prevents this from happening. This is done at initialization to prevent "unused parameters" errors
# from happening when the encoder is used before `prepare_for_training` is called, for example during batch
# size tuning.
out_module = list(self.model.modules())[-1]
out_module.requires_grad_(requires_grad=False)
@staticmethod
def get_schema_cls() -> type[BaseEncoderConfig]:
return LLMEncoderConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length, self.model_config.hidden_size])
def get_embedding_layer(self) -> nn.Module:
return self
def initialize_adapter(self):
"""If an adapter config is provided, we want to wrap the model with a PEFT model for fine-tuning."""
if self.config.adapter:
self.model = initialize_adapter(self.model, self.config)
logger.info("==================================================")
logger.info("Trainable Parameter Summary For LLM Encoder Fine-Tuning")
logger.info(f"Fine-tuning with adapter: {self.config.adapter.type}")
self.model.print_trainable_parameters()
logger.info("==================================================")
self.adapter_is_initialized = True
def prepare_for_training(self):
# TODO: this implementation will not work if resuming from a previous checkpoint. Need to fix this.
if self.config.quantization:
self.prepare_for_quantized_training()
self.initialize_adapter()
def prepare_for_quantized_training(self):
from peft import prepare_model_for_kbit_training
self.model = prepare_model_for_kbit_training(self.model, use_gradient_checkpointing=False)
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None):
# Get the hidden state of the last layer and return it as the text encoding
model_outputs = self.model(input_ids=inputs, output_hidden_states=True).hidden_states[-1]
return {ENCODER_OUTPUT: model_outputs.type(torch.float32)}
def _save_to_state_dict(self, destination: dict, prefix: str, keep_vars: bool):
# This is called by `torch.nn.Module.state_dict()` under the hood. `state_dict()` does additional work to
# prep the dictionary, get submodule state, and run hooks. Overriding this method only impacts the
# contents of the state_dict.
# The three args to this method are supplied by Module.state_dict
# https://github.com/pytorch/pytorch/blob/8739d1e3f9b08f4282fe79fc8dacd781d16913ff/torch/nn/modules/module.py#L1824
if self.config.adapter and self.adapter_is_initialized:
# get_peft_model_state_dict geneates a state dict that only contains the adapter weights
from peft.utils.save_and_load import get_peft_model_state_dict
sd = get_peft_model_state_dict(self.model)
destination.update(sd)
else:
super()._save_to_state_dict(destination, prefix=prefix, keep_vars=keep_vars)
def state_dict(self, *args, destination=None, prefix="", keep_vars=False):
destination = super().state_dict(destination, prefix=prefix, keep_vars=keep_vars)
if self.config.adapter and self.adapter_is_initialized:
adapter_type_prefix = self.ADAPTER_PARAM_NAME_PREFIX[self.config.adapter.type]
exclude_model_keys = [k for k in destination.keys() if adapter_type_prefix not in k]
for k in exclude_model_keys:
del destination[k]
return destination
def _load_from_state_dict(
self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs
):
# Call this first to make sure torch can do its usual load. In the adapter case, this should essentially be a
# no-op, but the adapter weights will be collected in `unexpected_keys` because PEFT changes the parameter
# names under the hood.
super()._load_from_state_dict(
state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs
)
if self.config.adapter and self.adapter_is_initialized:
# When using an adapter, only the adapter weights are saved, and so we only want to load those weights.
# Under the hood, PEFT alters the names of the parameters, which leads to an "unexpected keys" error when
# using strict mode. This block uses PEFT's version of `load_state_dict` to handle loading in weights.
from peft.utils.save_and_load import set_peft_model_state_dict
adapter_type_prefix = self.ADAPTER_PARAM_NAME_PREFIX[self.config.adapter.type]
peft_model_state_dict = {k: v for k, v in state_dict.items() if adapter_type_prefix in k}
set_peft_model_state_dict(self.model, peft_model_state_dict)
def remove_missing_non_adapter_keys(self, module, incompatible_keys):
"""Update the missing and unexpected keys lists to reflect custom adapter state load logic.
This method should never return anything unless the underlying torch hook logic is updated. Any changes to the
lists in `incompatible_keys` must be made in-place.
Args:
module: The torch module with newly loaded state
incompatible_keys: A tuple with the lists of missing and unexpected keys that were recorded while loading
"""
# If no adapter was used, `LLMEncoder.load_state_dict` should use the default `torch.Module.load_state_dict`
# code path to load weights and no modification should be necessary.
if self.config.adapter and self.adapter_is_initialized:
adapter_type_prefix = self.ADAPTER_PARAM_NAME_PREFIX[self.config.adapter.type]
missing_keys, unexpected_keys = incompatible_keys
# The state dict uses fully qualified parameter names, but this function does not have access to the
# fully qualified names or a prefix to recreate them. Iterate over the missing keys and greedily select the
# first non-adapter key that shares a suffix with a model parameter name.
sample_missing_key = ""
sample_model_key = ""
for k in missing_keys:
# Exclude any adapter weight--those should not be missing. Let torch handle that downstream.
if adapter_type_prefix not in k:
sample_model_keys = [p for p, _ in self.named_parameters() if p in k]
if sample_model_keys:
sample_model_key = sample_model_keys[0]
sample_missing_key = k
break
sd_prefix = sample_missing_key.replace(sample_model_key, "")
# When loading the adapter weights in strict mode, torch will register the base model weights as missing
# from the state dict and raise an exception. The base model weights are intended to be excluded, so the
# missing_keys list is updated post-load to avoid the error.
for k, _ in self.named_parameters():
full_name = f"{sd_prefix}{k}"
if full_name in missing_keys and adapter_type_prefix not in full_name:
missing_keys.remove(full_name)
# peft changes the adapter parameter names under the hood to include the adapter name. When retreiving the
# adapter state dict, however, the name is not included. This causes the adpater weights to be recorded as
# unexpected parameters. `LLMEncoder._load_from_state_dict` loads the adapter parameters using a peft
# utility that accounts for the updated names, so here we remove any adapter parameters from the unexpected
# keys list to avoid errors.
from peft.utils.save_and_load import get_peft_model_state_dict
sd = get_peft_model_state_dict(self.model)
for k in sd.keys():
if k in unexpected_keys:
unexpected_keys.remove(k)
================================================
FILE: ludwig/encoders/types.py
================================================
from typing import TypedDict
import torch
class EncoderOutputDict(TypedDict, total=False):
encoder_output: torch.Tensor
encoder_output_state: torch.Tensor # only used by sequence and h3 encoders
attentions: torch.Tensor # only used by the vit legacy encoder
================================================
FILE: ludwig/error.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from ludwig.api_annotations import PublicAPI
@PublicAPI
class LudwigError(Exception):
"""Base class for all custom exceptions raised by the Ludwig framework."""
def __reduce__(self):
"""Docs: https://docs.python.org/3/library/pickle.html#object.__reduce__."""
raise NotImplementedError(
"Implement __reduce__ for all subclasses of LudwigError as it's necessary for "
"serialization by Ray. See https://github.com/ludwig-ai/ludwig/pull/2695."
)
@PublicAPI
class InputDataError(LudwigError, ValueError):
"""Exception raised for errors in the input data.
Appropriate for data which is not convertible to the input feature type, columns with all missing values,
categorical columns with only one category, etc...
Attributes:
column - The name of the input column which caused the error
feature_type - The Ludwig feature type which caused the error (number, binary, category...).
message - An error message describing the situation.
"""
def __init__(self, column_name: str, feature_type: str, message: str):
self.column_name = column_name
self.feature_type = feature_type
self.message = message
super().__init__(message)
def __str__(self):
return f'Column "{self.column_name}" as {self.feature_type} feature: {self.message}'
def __reduce__(self):
return type(self), (self.column_name, self.feature_type, self.message)
@PublicAPI
class ConfigValidationError(LudwigError, ValueError):
"""Exception raised for errors in the Ludwig configuration.
Appropriate for bad configuration values, missing required configuration values, etc...
Attributes:
message - An error message describing the situation.
"""
def __init__(self, message: str):
self.message = message
super().__init__(message)
def __reduce__(self):
return type(self), (self.message,)
================================================
FILE: ludwig/evaluate.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import sys
import pandas as pd
from ludwig.api import LudwigModel
from ludwig.backend import ALL_BACKENDS, Backend, initialize_backend
from ludwig.callbacks import Callback
from ludwig.constants import FULL, TEST, TRAINING, VALIDATION
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig
logger = logging.getLogger(__name__)
def evaluate_cli(
model_path: str,
dataset: str | dict | pd.DataFrame = None,
data_format: str = None,
split: str = FULL,
batch_size: int = 128,
skip_save_unprocessed_output: bool = False,
skip_save_predictions: bool = False,
skip_save_eval_stats: bool = False,
skip_collect_predictions: bool = False,
skip_collect_overall_stats: bool = False,
output_directory: str = "results",
gpus: str | int | list[int] = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
callbacks: list[Callback] = None,
backend: Backend | str = None,
logging_level: int = logging.INFO,
**kwargs,
) -> None:
"""Loads pre-trained model and evaluates its performance by comparing the predictions against ground truth.
# Inputs
:param model_path: (str) filepath to pre-trained model.
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used in the evaluation.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param split: (str, default: `full`) split on which
to perform predictions. Valid values are `'training'`, `'validation'`,
`'test'` and `'full'`.
:param batch_size: (int, default `128`) size of batches for processing.
:param skip_save_unprocessed_output: (bool, default: `False`) by default
predictions and their probabilities are saved in both raw
unprocessed numpy files containing tensors and as postprocessed
CSV files (one for each output feature). If this parameter is True,
only the CSV ones are saved and the numpy ones are skipped.
:param skip_save_predictions: (bool, default: `False`) skips saving test
predictions CSV files
:param skip_save_eval_stats: (bool, default: `False`) skips saving test
statistics JSON file
:param skip_collect_predictions: (bool, default: `False`) skips
collecting post-processed predictions during eval.
:param skip_collect_overall_stats: (bool, default: `False`) skips
collecting overall stats during eval.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param gpus: (list, default: `None`) list of GPUs that are available
for training.
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow PyTorch
to use multithreading parallelism to improve performance at
the cost of determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param logging_level: (int) Log level that will be sent to stderr.
# Returns
:return: (`None`)
"""
model = LudwigModel.load(
model_path,
logging_level=logging_level,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
)
model.evaluate(
dataset=dataset,
data_format=data_format,
batch_size=batch_size,
split=split,
skip_save_unprocessed_output=skip_save_unprocessed_output,
skip_save_predictions=skip_save_predictions,
skip_save_eval_stats=skip_save_eval_stats,
collect_predictions=not skip_collect_predictions,
collect_overall_stats=not skip_collect_overall_stats,
output_directory=output_directory,
return_type="dict",
)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script loads a pretrained model "
"and evaluates its performance by comparing"
"its predictions with ground truth.",
prog="ludwig evaluate",
usage="%(prog)s [options]",
)
# ---------------
# Data parameters
# ---------------
parser.add_argument("--dataset", help="input data file path", required=True)
parser.add_argument(
"--data_format",
help="format of the input data",
default="auto",
choices=[
"auto",
"csv",
"excel",
"feather",
"fwf",
"hdf5",
"html" "tables",
"json",
"jsonl",
"parquet",
"pickle",
"sas",
"spss",
"stata",
"tsv",
],
)
parser.add_argument(
"-s", "--split", default=FULL, choices=[TRAINING, VALIDATION, TEST, FULL], help="the split to test the model on"
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=True)
# -------------------------
# Output results parameters
# -------------------------
parser.add_argument(
"-od", "--output_directory", type=str, default="results", help="directory that contains the results"
)
parser.add_argument(
"-ssuo",
"--skip_save_unprocessed_output",
help="skips saving intermediate NPY output files",
action="store_true",
default=False,
)
parser.add_argument(
"-sses",
"--skip_save_eval_stats",
help="skips saving intermediate JSON eval statistics",
action="store_true",
default=False,
)
parser.add_argument(
"-scp", "--skip_collect_predictions", help="skips collecting predictions", action="store_true", default=False
)
parser.add_argument(
"-scos",
"--skip_collect_overall_stats",
help="skips collecting overall stats",
action="store_true",
default=False,
)
# ------------------
# Generic parameters
# ------------------
parser.add_argument("-bs", "--batch_size", type=int, default=128, help="size of batches")
# ------------------
# Runtime parameters
# ------------------
parser.add_argument("-g", "--gpus", type=int, default=0, help="list of gpu to use")
parser.add_argument(
"-gml",
"--gpu_memory_limit",
type=float,
default=None,
help="maximum memory fraction [0, 1] allowed to allocate per GPU device",
)
parser.add_argument(
"-dpt",
"--disable_parallel_threads",
action="store_false",
dest="allow_parallel_threads",
help="disable PyTorch from using multithreading for reproducibility",
)
parser.add_argument(
"-b",
"--backend",
help="specifies backend to use for parallel / distributed execution, " "defaults to local execution",
choices=ALL_BACKENDS,
)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.evaluate_performance = True
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("evaluate", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.test_performance")
backend = initialize_backend(args.backend)
if backend.is_coordinator():
print_ludwig("Evaluate", LUDWIG_VERSION)
logger.info(f"Dataset path: {args.dataset}")
logger.info(f"Model path: {args.model_path}")
logger.info("")
evaluate_cli(**vars(args))
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/experiment.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import os
import sys
import pandas as pd
from ludwig.api import kfold_cross_validate, LudwigModel
from ludwig.backend import ALL_BACKENDS, Backend, initialize_backend
from ludwig.callbacks import Callback
from ludwig.constants import CONTINUE_PROMPT, FULL, HYPEROPT, HYPEROPT_WARNING, TEST, TRAINING, VALIDATION
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.data_utils import load_config_from_str, load_yaml, save_json
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig, query_yes_no
logger = logging.getLogger(__name__)
def experiment_cli(
config: str | dict,
dataset: str | dict | pd.DataFrame = None,
training_set: str | dict | pd.DataFrame = None,
validation_set: str | dict | pd.DataFrame = None,
test_set: str | dict | pd.DataFrame = None,
training_set_metadata: str | dict = None,
data_format: str = None,
experiment_name: str = "experiment",
model_name: str = "run",
model_load_path: str = None,
model_resume_path: str = None,
eval_split: str = TEST,
skip_save_training_description: bool = False,
skip_save_training_statistics: bool = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
skip_save_processed_input: bool = False,
skip_save_unprocessed_output: bool = False,
skip_save_predictions: bool = False,
skip_save_eval_stats: bool = False,
skip_collect_predictions: bool = False,
skip_collect_overall_stats: bool = False,
output_directory: str = "results",
gpus: str | int | list[int] = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
callbacks: list[Callback] = None,
backend: Backend | str = None,
random_seed: int = default_random_seed,
logging_level: int = logging.INFO,
**kwargs,
):
"""Trains a model on a dataset's training and validation splits and uses it to predict on the test split. It
saves the trained model and the statistics of training and testing.
# Inputs
:param config: (Union[str, dict]) in-memory representation of
config or string path to a YAML config file.
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used in the experiment.
If it has a split column, it will be used for splitting (0 for train,
1 for validation, 2 for test), otherwise the dataset will be
randomly split.
:param training_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing training data.
:param validation_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing validation data.
:param test_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing test data.
:param training_set_metadata: (Union[str, dict], default: `None`)
metadata JSON file or loaded metadata. Intermediate preprocessed
structure containing the mappings of the input
dataset created the first time an input file is used in the same
directory with the same name and a '.meta.json' extension.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param experiment_name: (str, default: `'experiment'`) name for
the experiment.
:param model_name: (str, default: `'run'`) name of the model that is
being used.
:param model_load_path: (str, default: `None`) if this is specified the
loaded model will be used as initialization
(useful for transfer learning).
:param model_resume_path: (str, default: `None`) resumes training of
the model from the path specified. The config is restored.
In addition to config, training statistics and loss for
epoch and the state of the optimizer are restored such that
training can be effectively continued from a previously interrupted
training process.
:param eval_split: (str, default: `test`) split on which
to perform evaluation. Valid values are `training`, `validation`
and `test`.
:param skip_save_training_description: (bool, default: `False`) disables
saving the description JSON file.
:param skip_save_training_statistics: (bool, default: `False`) disables
saving training statistics JSON file.
:param skip_save_model: (bool, default: `False`) disables
saving model weights and hyperparameters each time the model
improves. By default Ludwig saves model weights after each epoch
the validation metric improves, but if the model is really big
that can be time consuming. If you do not want to keep
the weights and just find out what performance a model can get
with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on and the returned model
will have the weights obtained at the end of training, instead of
the weights of the epoch with the best validation performance.
:param skip_save_progress: (bool, default: `False`) disables saving
progress each epoch. By default Ludwig saves weights and stats
after each epoch for enabling resuming of training, but if
the model is really big that can be time consuming and will uses
twice as much space, use this parameter to skip it, but training
cannot be resumed later on.
:param skip_save_log: (bool, default: `False`) disables saving
TensorBoard logs. By default Ludwig saves logs for the TensorBoard,
but if it is not needed turning it off can slightly increase the
overall speed.
:param skip_save_processed_input: (bool, default: `False`) if input
dataset is provided it is preprocessed and cached by saving an HDF5
and JSON files to avoid running the preprocessing again. If this
parameter is `False`, the HDF5 and JSON file are not saved.
:param skip_save_unprocessed_output: (bool, default: `False`) by default
predictions and their probabilities are saved in both raw
unprocessed numpy files containing tensors and as postprocessed
CSV files (one for each output feature). If this parameter is True,
only the CSV ones are saved and the numpy ones are skipped.
:param skip_save_predictions: (bool, default: `False`) skips saving test
predictions CSV files
:param skip_save_eval_stats: (bool, default: `False`) skips saving test
statistics JSON file
:param skip_collect_predictions: (bool, default: `False`) skips
collecting post-processed predictions during eval.
:param skip_collect_overall_stats: (bool, default: `False`) skips
collecting overall stats during eval.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param gpus: (list, default: `None`) list of GPUs that are available
for training.
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow PyTorch
to use multithreading parallelism to improve performance at
the cost of determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param random_seed: (int: default: 42) random seed used for weights
initialization, splits and any other random function.
:param logging_level: (int) Log level that will be sent to stderr.
# Return
:return: (Tuple[LudwigModel, dict, dict, tuple, str)):
`(model, evaluation_statistics, training_statistics, preprocessed_data, output_directory)`
`model` LudwigModel instance
`evaluation_statistics` dictionary with evaluation performance
statistics on the test_set,
`training_statistics` is a nested dictionary of dataset -> feature_name -> metric_name -> List of metrics.
Each metric corresponds to each training checkpoint.
`preprocessed_data` tuple containing preprocessed
`(training_set, validation_set, test_set)`, `output_directory`
filepath string to where results are stored.
"""
if HYPEROPT in config:
if not query_yes_no(HYPEROPT_WARNING + CONTINUE_PROMPT):
exit(1)
if isinstance(config, str):
config = load_yaml(config)
backend = initialize_backend(backend or config.get("backend"))
if model_load_path:
model = LudwigModel.load(
model_load_path,
logging_level=logging_level,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
)
else:
model = LudwigModel(
config=config,
logging_level=logging_level,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
)
eval_stats, train_stats, preprocessed_data, output_directory = model.experiment(
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
experiment_name=experiment_name,
model_name=model_name,
model_resume_path=model_resume_path,
eval_split=eval_split,
skip_save_training_description=skip_save_training_description,
skip_save_training_statistics=skip_save_training_statistics,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
skip_save_processed_input=skip_save_processed_input,
skip_save_unprocessed_output=skip_save_unprocessed_output,
skip_save_predictions=skip_save_predictions,
skip_save_eval_stats=skip_save_eval_stats,
skip_collect_predictions=skip_collect_predictions,
skip_collect_overall_stats=skip_collect_overall_stats,
output_directory=output_directory,
random_seed=random_seed,
)
return model, eval_stats, train_stats, preprocessed_data, output_directory
def kfold_cross_validate_cli(
k_fold,
config=None,
dataset=None,
data_format=None,
output_directory="results",
random_seed=default_random_seed,
skip_save_k_fold_split_indices=False,
**kwargs,
):
"""Wrapper function to performs k-fold cross validation.
# Inputs
:param k_fold: (int) number of folds to create for the cross-validation
:param config: (Union[str, dict], default: None) a dictionary or file path containing model configuration. Refer to
the [User Guide] (http://ludwig.ai/user_guide/#model-config) for details.
:param dataset: (string, default: None)
:param output_directory: (string, default: 'results')
:param random_seed: (int) Random seed used k-fold splits.
:param skip_save_k_fold_split_indices: (boolean, default: False) Disables saving k-fold split indices
:return: None
"""
kfold_cv_stats, kfold_split_indices = kfold_cross_validate(
k_fold,
config=config,
dataset=dataset,
data_format=data_format,
output_directory=output_directory,
random_seed=random_seed,
)
# save k-fold cv statistics
save_json(os.path.join(output_directory, "kfold_training_statistics.json"), kfold_cv_stats)
# save k-fold split indices
if not skip_save_k_fold_split_indices:
save_json(os.path.join(output_directory, "kfold_split_indices.json"), kfold_split_indices)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script trains and evaluates a model", prog="ludwig experiment", usage="%(prog)s [options]"
)
# ----------------------------
# Experiment naming parameters
# ----------------------------
parser.add_argument("--output_directory", type=str, default="results", help="directory that contains the results")
parser.add_argument("--experiment_name", type=str, default="experiment", help="experiment name")
parser.add_argument("--model_name", type=str, default="run", help="name for the model")
# ---------------
# Data parameters
# ---------------
parser.add_argument(
"--dataset",
help="input data file path. "
"If it has a split column, it will be used for splitting "
"(0: train, 1: validation, 2: test), "
"otherwise the dataset will be randomly split",
)
parser.add_argument("--training_set", help="input train data file path")
parser.add_argument("--validation_set", help="input validation data file path")
parser.add_argument("--test_set", help="input test data file path")
parser.add_argument(
"--training_set_metadata",
help="input metadata JSON file path. An intermediate preprocessed file "
"containing the mappings of the input file created "
"the first time a file is used, in the same directory "
"with the same name and a .json extension",
)
parser.add_argument(
"--data_format",
help="format of the input data",
default="auto",
choices=[
"auto",
"csv",
"excel",
"feather",
"fwf",
"hdf5",
"html" "tables",
"json",
"jsonl",
"parquet",
"pickle",
"sas",
"spss",
"stata",
"tsv",
],
)
parser.add_argument(
"-es",
"--eval_split",
default=TEST,
choices=[TRAINING, VALIDATION, TEST, FULL],
help="the split to evaluate the model on",
)
parser.add_argument(
"-sspi",
"--skip_save_processed_input",
help="skips saving intermediate HDF5 and JSON files",
action="store_true",
default=False,
)
parser.add_argument(
"-ssuo",
"--skip_save_unprocessed_output",
help="skips saving intermediate NPY output files",
action="store_true",
default=False,
)
# -----------------
# K-fold parameters
# -----------------
parser.add_argument(
"-kf", "--k_fold", type=int, default=None, help="number of folds for a k-fold cross validation run "
)
parser.add_argument(
"-skfsi",
"--skip_save_k_fold_split_indices",
action="store_true",
default=False,
help="disables saving indices generated to split training data set "
"for the k-fold cross validation run, but if it is not needed "
"turning it off can slightly increase the overall speed",
)
# ----------------
# Model parameters
# ----------------
config = parser.add_mutually_exclusive_group(required=True)
config.add_argument(
"-c",
"--config",
type=load_yaml,
help="Path to the YAML file containing the model configuration",
)
config.add_argument(
"-cs",
"--config_str",
dest="config",
type=load_config_from_str,
help="JSON or YAML serialized string of the model configuration",
)
parser.add_argument("-mlp", "--model_load_path", help="path of a pretrained model to load as initialization")
parser.add_argument("-mrp", "--model_resume_path", help="path of the model directory to resume training of")
parser.add_argument(
"-sstd",
"--skip_save_training_description",
action="store_true",
default=False,
help="disables saving the description JSON file",
)
parser.add_argument(
"-ssts",
"--skip_save_training_statistics",
action="store_true",
default=False,
help="disables saving training statistics JSON file",
)
parser.add_argument(
"-sstp",
"--skip_save_predictions",
help="skips saving test predictions CSV files",
action="store_true",
default=False,
)
parser.add_argument(
"-sstes",
"--skip_save_eval_stats",
help="skips saving eval statistics JSON file",
action="store_true",
default=False,
)
parser.add_argument(
"-ssm",
"--skip_save_model",
action="store_true",
default=False,
help="disables saving model weights and hyperparameters each time "
"the model improves. "
"By default Ludwig saves model weights after each epoch "
"the validation metric improves, but if the model is really big "
"that can be time consuming. If you do not want to keep "
"the weights and just find out what performance a model can get "
"with a set of hyperparameters, use this parameter to skip it,"
"but the model will not be loadable later on",
)
parser.add_argument(
"-ssp",
"--skip_save_progress",
action="store_true",
default=False,
help="disables saving progress each epoch. By default Ludwig saves "
"weights and stats after each epoch for enabling resuming "
"of training, but if the model is really big that can be "
"time consuming and will uses twice as much space, use "
"this parameter to skip it, but training cannot be resumed "
"later on",
)
parser.add_argument(
"-ssl",
"--skip_save_log",
action="store_true",
default=False,
help="disables saving TensorBoard logs. By default Ludwig saves "
"logs for the TensorBoard, but if it is not needed turning it off "
"can slightly increase the overall speed",
)
# ------------------
# Runtime parameters
# ------------------
parser.add_argument(
"-rs",
"--random_seed",
type=int,
default=42,
help="a random seed that is going to be used anywhere there is a call "
"to a random number generator: data splitting, parameter "
"initialization and training set shuffling",
)
parser.add_argument("-g", "--gpus", nargs="+", type=int, default=None, help="list of GPUs to use")
parser.add_argument(
"-gml",
"--gpu_memory_limit",
type=float,
default=None,
help="maximum memory fraction [0, 1] allowed to allocate per GPU device",
)
parser.add_argument(
"-dpt",
"--disable_parallel_threads",
action="store_false",
dest="allow_parallel_threads",
help="disable PyTorch from using multithreading for reproducibility",
)
parser.add_argument(
"-b",
"--backend",
help="specifies backend to use for parallel / distributed execution, " "defaults to local execution",
choices=ALL_BACKENDS,
)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("experiment", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.experiment")
args.backend = initialize_backend(args.backend or args.config.get("backend"))
if args.backend.is_coordinator():
print_ludwig("Experiment", LUDWIG_VERSION)
if args.k_fold is None:
experiment_cli(**vars(args))
else:
kfold_cross_validate_cli(**vars(args))
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/explain/__init__.py
================================================
================================================
FILE: ludwig/explain/captum.py
================================================
import copy
import gc
import logging
from collections import defaultdict
from dataclasses import dataclass
import numpy as np
import numpy.typing as npt
import pandas as pd
import torch
from captum.attr import LayerIntegratedGradients, TokenReferenceBase
from captum.attr._utils.input_layer_wrapper import InputIdentity
from torch.autograd import Variable
from tqdm import tqdm
from ludwig.api import LudwigModel
from ludwig.api_annotations import PublicAPI
from ludwig.constants import (
BINARY,
CATEGORY,
DATE,
IMAGE,
INPUT_FEATURES,
MINIMUM_BATCH_SIZE,
NAME,
NUMBER,
PREPROCESSING,
SEQUENCE,
SET,
TEXT,
UNKNOWN_SYMBOL,
)
from ludwig.data.preprocessing import preprocess_for_prediction
from ludwig.explain.explainer import Explainer
from ludwig.explain.explanation import ExplanationsResult
from ludwig.explain.util import get_pred_col, replace_layer_with_copy
from ludwig.features.feature_utils import LudwigFeatureDict
from ludwig.models.ecd import ECD
from ludwig.utils.torch_utils import DEVICE
logger = logging.getLogger(__name__)
# These types as provided as integer values and passed through an embedding layer that breaks integrated gradients.
# As such, we need to take care to encode them before handing them to the explainer.
EMBEDDED_TYPES = {SEQUENCE, TEXT, CATEGORY, SET, DATE}
@dataclass
class ExplanationRunConfig:
"""Mutable state containing runtime configuration for explanation process.
This is useful for updating the batch size used during explanation so it can be propagated across calls to
`get_total_attribution`.
"""
batch_size: int
def retry_with_halved_batch_size(run_config: ExplanationRunConfig):
"""Function wrapper that retries an fn with a halved batch size.
We want to maintain as large of a batch size as possible to maximize throughput. However, calculating explanations
requires significantly more memory, and the original batch sized used during training may be too large and cause a
CUDA OOM error, for example, if using GPUs.
Will raise an error if a non-OOM error is raised, or if the batch size is reduced below 1 and the fn still fails.
"""
def retry_with_halved_batch_size_fn(fn):
def retry_with_halved_batch_size_wrapper(*args, **kwargs):
latest_error = None
while run_config.batch_size >= MINIMUM_BATCH_SIZE:
try:
return fn(*args, **kwargs)
except RuntimeError as e:
latest_error = e
# PyTorch only generates Runtime errors for CUDA OOM.
gc.collect()
if "CUDA out of memory" in str(e) or isinstance(e, torch.cuda.OutOfMemoryError):
logger.exception(f"OOM at batch_size={run_config.batch_size}, halving and trying again")
run_config.batch_size //= 2
else:
# Not a CUDA error
raise
raise RuntimeError(
f"Ran into latest error {latest_error} during explanation. "
"If a CUDA out of memory error, then the batch size could not be reduced any further."
)
return retry_with_halved_batch_size_wrapper
return retry_with_halved_batch_size_fn
class WrapperModule(torch.nn.Module):
"""Model used by the explainer to generate predictions.
Unlike Ludwig's ECD class, this wrapper takes individual args as inputs to the forward function. We derive the order
of these args from the order of the input_feature keys in ECD, which is guaranteed to be consistent (Python
dictionaries are ordered consistently), so we can map back to the input feature dictionary as a second step within
this wrapper.
"""
def __init__(self, model: ECD, target: str):
super().__init__()
self.model = model
self.target = target
self.input_maps = LudwigFeatureDict()
self.input_maps.update(
{
arg_name: InputIdentity(arg_name)
for arg_name in self.model.input_features.keys()
if self.model.input_features.get(arg_name).type() not in EMBEDDED_TYPES
}
)
def forward(self, *args):
# Add back the dictionary structure so it conforms to ECD format.
input_features: LudwigFeatureDict = self.model.input_features
inputs = {
# Send the input through the identity layer so that we can use the output of the layer for attribution.
# Except for text/category features where we use the embedding layer for attribution.
feat_name: (
feat_input
if input_features.get(feat_name).type() in EMBEDDED_TYPES
else self.input_maps.get(feat_name)(feat_input)
)
for feat_name, feat_input in zip(input_features.keys(), args)
}
outputs = self.model(inputs)
# At this point we only have the raw logits, but to make explainability work we need the probabilities
# and predictions as well, so derive them.
predictions = {}
for of_name in self.model.output_features:
predictions[of_name] = self.model.output_features.get(of_name).predictions(outputs, of_name)
pred_t = get_pred_col(predictions, self.target)
# If the target feature is a non-scalar type (vector, set, etc.), sum it to get a scalar value.
# https://github.com/pytorch/captum/issues/377
if len(pred_t.shape) > 1 and self.model.output_features.get(self.target).type() not in {
CATEGORY,
NUMBER,
BINARY,
}:
pred_t = torch.sum(pred_t.reshape(pred_t.shape[0], -1), dim=1)
return pred_t
@PublicAPI(stability="experimental")
class IntegratedGradientsExplainer(Explainer):
def explain(self) -> ExplanationsResult:
"""Explain the model's predictions using Integrated Gradients.
# Return
:return: ExplanationsResult containing the explanations.
`global_explanations`: (Explanation) Aggregate explanation for the entire input data.
`row_explanations`: (List[Explanation]) A list of explanations, one for each row in the input data. Each
explanation contains the integrated gradients for each label in the target feature's vocab with respect to
each input feature.
`expected_values`: (List[float]) of length [output feature cardinality] Average convergence delta for each
label in the target feature's vocab.
"""
# TODO(travis): add back skip encoders at the end in finally. Shouldn't be an issue in most cases as we
# typically perform explanations on a loaded model and don't use it to predict afterwards.
self.model.model.unskip()
self.model.model.to(DEVICE)
input_features: LudwigFeatureDict = self.model.model.input_features
run_config = ExplanationRunConfig(batch_size=self.model.config_obj.trainer.batch_size)
get_input_tensors_with_retry = retry_with_halved_batch_size(run_config)(get_input_tensors)
get_total_attribution_with_retry = retry_with_halved_batch_size(run_config)(get_total_attribution)
# Convert input data into embedding tensors from the output of the model encoders.
inputs_encoded = get_input_tensors_with_retry(self.model, self.inputs_df, run_config)
sample_encoded = get_input_tensors_with_retry(self.model, self.sample_df, run_config)
baseline = get_baseline(self.model, sample_encoded)
# Compute attribution for each possible output feature label separately.
expected_values = []
for target_idx in tqdm(range(self.vocab_size), desc="Explain"):
total_attribution, feat_to_token_attributions, total_attribution_global = get_total_attribution_with_retry(
self.model,
self.target_feature_name,
target_idx if self.is_category_target else None,
inputs_encoded,
baseline,
len(self.inputs_df),
run_config,
)
# Aggregate token attributions
feat_to_token_attributions_global = {}
for feat_name, token_attributions in feat_to_token_attributions.items():
token_attributions_global = defaultdict(float)
# sum attributions for each token
for token, token_attribution in (ta for tas in token_attributions for ta in tas):
token_attributions_global[token] += abs(token_attribution)
# divide by number of samples to get average attribution per token
token_attributions_global = {
token: token_attribution / max(0, len(token_attributions))
for token, token_attribution in token_attributions_global.items()
}
# convert to list of tuples and sort by attribution
token_attributions_global = sorted(token_attributions_global.items(), key=lambda x: x[1], reverse=True)
# keep only top 100 tokens
token_attributions_global = token_attributions_global[:100]
feat_to_token_attributions_global[feat_name] = token_attributions_global
self.global_explanation.add(
input_features.keys(), total_attribution_global, feat_to_token_attributions_global
)
for i, (feature_attributions, explanation) in enumerate(zip(total_attribution, self.row_explanations)):
# Add the feature attributions to the explanation object for this row.
explanation.add(
input_features.keys(),
feature_attributions,
{k: v[i] for k, v in feat_to_token_attributions.items()},
)
# TODO(travis): for force plots, need something similar to SHAP E[X]
expected_values.append(0.0)
if self.is_binary_target:
# For binary targets, we only need to compute attribution for the positive class (see below).
break
# For binary targets, add an extra attribution for the negative class (false).
if self.is_binary_target:
le_true = self.global_explanation.label_explanations[0]
negated_attributions = le_true.to_array() * -1
negated_token_attributions = {
fa.feature_name: [(t, -a) for t, a in fa.token_attributions]
for fa in le_true.feature_attributions
if fa.token_attributions is not None
}
# Prepend the negative class to the list of label explanations.
self.global_explanation.add(
input_features.keys(), negated_attributions, negated_token_attributions, prepend=True
)
for explanation in self.row_explanations:
le_true = explanation.label_explanations[0]
negated_attributions = le_true.to_array() * -1
negated_token_attributions = {
fa.feature_name: [(t, -a) for t, a in fa.token_attributions]
for fa in le_true.feature_attributions
if fa.token_attributions is not None
}
# Prepend the negative class to the list of label explanations.
explanation.add(input_features.keys(), negated_attributions, negated_token_attributions, prepend=True)
# TODO(travis): for force plots, need something similar to SHAP E[X]
expected_values.append(0.0)
return ExplanationsResult(self.global_explanation, self.row_explanations, expected_values)
def get_input_tensors(
model: LudwigModel, input_set: pd.DataFrame, run_config: ExplanationRunConfig
) -> list[torch.Tensor]:
"""Convert the input data into a list of variables, one for each input feature.
# Inputs
:param model: The LudwigModel to use for encoding.
:param input_set: The input data to encode of shape [batch size, num input features]. # Return
:return: A list of variables, one for each input feature. Shape of each variable is [batch size, embedding size].
"""
# Ignore sample_ratio and sample_size from the model config, since we want to explain all the data.
sample_ratio_bak = model.config_obj.preprocessing.sample_ratio
sample_size_bak = model.config_obj.preprocessing.sample_size
model.config_obj.preprocessing.sample_ratio = 1.0
model.config_obj.preprocessing.sample_size = None
config = model.config_obj.to_dict()
training_set_metadata = copy.deepcopy(model.training_set_metadata)
for feature in config[INPUT_FEATURES]:
preprocessing = training_set_metadata[feature[NAME]][PREPROCESSING]
if preprocessing.get("cache_encoder_embeddings"):
preprocessing["cache_encoder_embeddings"] = False
# Convert raw input data into preprocessed tensor data
dataset, _ = preprocess_for_prediction(
config,
dataset=input_set,
training_set_metadata=training_set_metadata,
data_format="auto",
split="full",
include_outputs=False,
backend=model.backend,
callbacks=model.callbacks,
)
# Restore sample_ratio and sample_size
model.config_obj.preprocessing.sample_ratio = sample_ratio_bak
model.config_obj.preprocessing.sample_size = sample_size_bak
# Make sure the number of rows in the preprocessed dataset matches the number of rows in the input data
assert (
dataset.to_df().shape[0] == input_set.shape[0]
), f"Expected {input_set.shape[0]} rows in preprocessed dataset, but got {dataset.to_df().shape[0]}"
# Convert dataset into a dict of tensors, and split each tensor into batches to control GPU memory usage
inputs = {
name: torch.from_numpy(dataset.dataset[feature.proc_column]).split(run_config.batch_size)
for name, feature in model.model.input_features.items()
}
# Dict of lists to list of dicts
input_batches = [dict(zip(inputs, t)) for t in zip(*inputs.values())]
# List of dicts to dict of lists
preproc_inputs = {k: torch.cat([d[k] for d in input_batches]) for k in input_batches[0]}
data_to_predict = [v for _, v in preproc_inputs.items()]
tensors = []
for t in data_to_predict:
# TODO(travis): Consider changing to `if not torch.is_floating_point(t.dtype)` to simplify, then handle bool
# case in this block.
if t.dtype == torch.int8 or t.dtype == torch.int16 or t.dtype == torch.int32 or t.dtype == torch.int64:
# Don't wrap input into a variable if it's an integer type, since it will be used as an index into the
# embedding table. We explain the output of the embedding table, not the input to the embedding table using
# LayerIntegratedGradients.
tensors.append(t)
else:
# Wrap input into a variable so torch will track the gradient and LayerIntegratedGradients can explain it.
if t.dtype == torch.bool:
t = t.to(torch.float32)
tensors.append(Variable(t, requires_grad=True))
return tensors
def get_baseline(model: LudwigModel, sample_encoded: list[Variable]) -> list[torch.Tensor]:
# TODO(travis): pre-compute this during training from the full training dataset.
input_features: LudwigFeatureDict = model.model.input_features
baselines = []
for sample_input, (name, feature) in zip(sample_encoded, input_features.items()):
metadata = model.training_set_metadata[name]
if feature.type() == TEXT:
PAD_IND = metadata.get("pad_idx", metadata.get("word_pad_idx"))
token_reference = TokenReferenceBase(reference_token_idx=PAD_IND)
baseline = token_reference.generate_reference(sequence_length=sample_input.shape[1], device=DEVICE)
elif feature.type() == CATEGORY:
most_popular_token = max(metadata["str2freq"], key=metadata["str2freq"].get)
most_popular_tok_idx = metadata["str2idx"].get(most_popular_token)
# If an unknown is defined, use that as the baseline index, else use the most popular token
baseline_tok_idx = metadata["str2idx"].get(UNKNOWN_SYMBOL, most_popular_tok_idx)
baseline = torch.tensor(baseline_tok_idx, device=DEVICE)
elif feature.type() == IMAGE:
baseline = torch.zeros_like(sample_input[0], device=DEVICE)
else:
# For a robust baseline, we take the mean of all samples from the training data.
baseline = torch.mean(sample_input.float(), dim=0)
baselines.append(baseline.unsqueeze(0))
return baselines
def get_total_attribution(
model: LudwigModel,
target_feature_name: str,
target_idx: int | None,
feature_inputs: list[Variable],
baseline: list[torch.Tensor],
nsamples: int,
run_config: ExplanationRunConfig,
) -> tuple[npt.NDArray[np.float64], dict[str, list[list[tuple[str, float]]]]]:
"""Compute the total attribution for each input feature for each row in the input data.
Args:
model: The Ludwig model to explain.
target_feature_name: The name of the target feature to explain.
target_idx: The index of the target feature label to explain if the target feature is a category.
feature_inputs: The preprocessed input data as a list of tensors of length [num_features].
baseline: The baseline input data as a list of tensors of length [num_features].
nsamples: The total number of samples in the input data.
Returns:
The token-attribution pair for each token in the input feature for each row in the input data. The members of
the output tuple are structured as follows:
`total_attribution_rows`: (npt.NDArray[np.float64]) of shape [num_rows, num_features]
The total attribution for each input feature for each row in the input data.
`feat_to_token_attributions`: (Dict[str, List[List[Tuple[str, float]]]]) with values of shape
[num_rows, seq_len, 2]
`total_attribution_global`: (npt.NDArray[np.float64]) of shape [num_features]
The attribution for each input feature aggregated across all input data.
"""
input_features: LudwigFeatureDict = model.model.input_features
# Configure the explainer, which includes wrapping the model so its interface conforms to
# the format expected by Captum.
model.model.zero_grad()
explanation_model = WrapperModule(model.model, target_feature_name)
layers = []
for feat_name, feat in input_features.items():
if feat.type() in EMBEDDED_TYPES:
# Get embedding layer from encoder, which is the first child of the encoder.
target_layer = feat.encoder_obj.get_embedding_layer()
# If the current layer matches any layer in the list, make a deep copy of the layer.
if len(layers) > 0 and any(target_layer == layer for layer in layers):
# Replace the layer with a deep copy of the layer to ensure that the attributions unique for each input
# feature that uses a shared layer.
# Recommended here: https://github.com/pytorch/captum/issues/794#issuecomment-1093021638
replace_layer_with_copy(feat, target_layer)
target_layer = feat.encoder_obj.get_embedding_layer() # get the new copy
else:
# Get the wrapped input layer.
target_layer = explanation_model.input_maps.get(feat_name)
layers.append(target_layer)
explainer = LayerIntegratedGradients(explanation_model, layers)
feature_inputs_splits = [ipt.split(run_config.batch_size) for ipt in feature_inputs]
baseline = [t.to(DEVICE) for t in baseline]
total_attribution_rows = None
total_attribution_global = None
feat_to_token_attributions = defaultdict(list)
for input_batch in zip(*feature_inputs_splits):
input_batch = [ipt.to(DEVICE) for ipt in input_batch]
attribution = explainer.attribute(
tuple(input_batch),
baselines=tuple(baseline),
target=target_idx,
# https://captum.ai/docs/faq#i-am-facing-out-of-memory-oom-errors-when-using-captum-how-do-i-resolve-this
internal_batch_size=run_config.batch_size,
)
attributions_reduced = []
for a in attribution:
a_reduced = a.detach().cpu()
if a_reduced.ndim == 2 or a_reduced.ndim == 3:
# Reduces category-level attributions of shape [batch_size, embedding_dim] by summing over the
# embedding dimension to get attributions of shape [batch_size].
# Reduces token-level attributions of shape [batch_size, sequence_length, embedding_dim] by summing
# over the embedding dimension to get attributions of shape [batch_size, sequence_length]. We keep
# the sequence dimension so we can map the attributions to the tokens.
a_reduced = a_reduced.sum(dim=-1)
elif a_reduced.ndim == 4:
# Reduce pixel-level attributions of shape [batch_size, num_channels, height, width] by summing
# over the channel and spatial dimensions to get attributions of shape [batch_size].
a_reduced = a_reduced.sum(dim=(1, 2, 3))
attributions_reduced.append(a_reduced)
for inputs, attrs, (name, feat) in zip(input_batch, attributions_reduced, input_features.items()):
if feat.type() == TEXT:
tok_attrs = get_token_attributions(model, name, inputs.detach().cpu(), attrs)
feat_to_token_attributions[name].append(tok_attrs)
# Reduce attribution to [num_input_features, batch_size] by summing over the sequence dimension (if present).
attribution = [a.sum(dim=-1) if a.ndim == 2 else a for a in attributions_reduced]
attribution = np.stack(attribution)
# Transpose to [batch_size, num_input_features]
attribution = attribution.T
if total_attribution_rows is not None:
total_attribution_rows = np.concatenate([total_attribution_rows, attribution], axis=0)
else:
total_attribution_rows = attribution
if total_attribution_global is not None:
total_attribution_global += attribution.sum(axis=0)
else:
total_attribution_global = attribution.sum(axis=0)
total_attribution_global /= nsamples
feat_to_token_attributions = {k: [e for lst in v for e in lst] for k, v in feat_to_token_attributions.items()}
return total_attribution_rows, feat_to_token_attributions, total_attribution_global
def get_token_attributions(
model: LudwigModel,
feature_name: str,
input_ids: torch.Tensor,
token_attributions: torch.Tensor,
) -> list[list[tuple[str, float]]]:
"""Convert token-level attributions to an array of token-attribution pairs of shape.
[batch_size, sequence_length, 2].
Args:
model: The LudwigModel used to generate the attributions.
feature_name: The name of the feature for which the attributions were generated.
input_ids: The input ids of shape [batch_size, sequence_length].
token_attributions: The token-level attributions of shape [batch_size, sequence_length].
Returns:
An array of token-attribution pairs of shape [batch_size, sequence_length, 2].
"""
assert (
input_ids.dtype == torch.int8
or input_ids.dtype == torch.int16
or input_ids.dtype == torch.int32
or input_ids.dtype == torch.int64
)
# Normalize token-level attributions to visualize the relative importance of each token.
norm = torch.linalg.norm(token_attributions, dim=1)
# Safe divide by zero by setting the norm to 1 if the norm is 0.
norm = torch.where(norm == 0, torch.ones_like(norm), norm)
token_attributions = token_attributions / norm.unsqueeze(-1)
# map input ids to input tokens via the vocabulary
feature = model.training_set_metadata[feature_name]
vocab = feature.get("idx2str", feature.get("word_idx2str"))
idx2str = np.vectorize(lambda idx: vocab[idx])
input_tokens = idx2str(input_ids)
# add attribution to the input tokens
tok_attrs = [
list(zip(t, a)) for t, a in zip(input_tokens, token_attributions.tolist())
] # [batch_size, sequence_length, 2]
return tok_attrs
================================================
FILE: ludwig/explain/captum_ray.py
================================================
from collections import defaultdict
from typing import Any
import numpy as np
import pandas as pd
import ray
from torch.autograd import Variable
from tqdm import tqdm
from ludwig.api import LudwigModel
from ludwig.api_annotations import PublicAPI
from ludwig.explain.captum import (
ExplanationRunConfig,
get_baseline,
get_input_tensors,
get_total_attribution,
IntegratedGradientsExplainer,
retry_with_halved_batch_size,
)
from ludwig.explain.explanation import ExplanationsResult
from ludwig.features.feature_utils import LudwigFeatureDict
from ludwig.utils.torch_utils import get_torch_device
@PublicAPI(stability="experimental")
class RayIntegratedGradientsExplainer(IntegratedGradientsExplainer):
def __init__(self, *args, resources_per_task: dict[str, Any] = None, num_workers: int = 1, **kwargs):
super().__init__(*args, **kwargs)
self.resources_per_task = resources_per_task or {}
self.num_workers = num_workers
def explain(self) -> ExplanationsResult:
"""Explain the model's predictions using Integrated Gradients.
# Return
:return: ExplanationsResult containing the explanations.
`global_explanations`: (Explanation) Aggregate explanation for the entire input data.
`row_explanations`: (List[Explanation]) A list of explanations, one for each row in the input data. Each
explanation contains the integrated gradients for each label in the target feature's vocab with respect to
each input feature.
`expected_values`: (List[float]) of length [output feature cardinality] Average convergence delta for each
label in the target feature's vocab.
"""
self.model.model.cpu()
input_features: LudwigFeatureDict = self.model.model.input_features
model_ref = ray.put(self.model)
run_config = ExplanationRunConfig(batch_size=self.model.config_obj.trainer.batch_size)
# Convert input data into embedding tensors from the output of the model encoders.
inputs_encoded_ref = get_input_tensors_task.options(**self.resources_per_task).remote(
model_ref, ray.put(self.inputs_df), run_config
)
sample_encoded_ref = get_input_tensors_task.options(**self.resources_per_task).remote(
model_ref, ray.put(self.sample_df), run_config
)
inputs_encoded, run_config = ray.get(inputs_encoded_ref)
sample_encoded, run_config = ray.get(sample_encoded_ref)
baseline = get_baseline(self.model, sample_encoded)
inputs_encoded_ref = ray.put(inputs_encoded)
baseline_ref = ray.put(baseline)
if self.is_category_target:
# Evenly divide the list of labels among the desired number of workers (Ray tasks).
# For example, 4 GPUs -> 4 workers. We do this instead of creating nlabels tasks because
# there is significant overhead to spawning a Ray task.
target_splits = split_list(list(range(self.vocab_size)), self.num_workers)
else:
# No target index to compare against exists for number features.
# For binary targets, we only need to compute attribution for the positive class (see below).
# May need to revisit in the future for additional feature types.
target_splits = [[None]]
# Compute attribution for each possible output feature label separately.
attrs_refs = []
for target_indices in target_splits:
attrs_ref = get_total_attribution_task.options(**self.resources_per_task).remote(
model_ref,
self.target_feature_name,
target_indices,
inputs_encoded_ref,
baseline_ref,
len(self.inputs_df),
run_config,
)
attrs_refs.append(attrs_ref)
# Await the completion of our Ray tasks, then merge the results.
expected_values = []
for attrs_ref in tqdm(attrs_refs, desc="Explain"):
attrs = ray.get(attrs_ref)
for total_attribution, feat_to_token_attributions, total_attribution_global in attrs:
# Aggregate token attributions
feat_to_token_attributions_global = {}
for feat_name, token_attributions in feat_to_token_attributions.items():
token_attributions_global = defaultdict(float)
# sum attributions for each token
for token, token_attribution in (ta for tas in token_attributions for ta in tas):
token_attributions_global[token] += token_attribution
# divide by number of samples to get average attribution per token
token_attributions_global = {
token: token_attribution / max(0, len(token_attributions))
for token, token_attribution in token_attributions_global.items()
}
# convert to list of tuples and sort by attribution
token_attributions_global = sorted(
token_attributions_global.items(), key=lambda x: x[1], reverse=True
)
# keep only top 100 tokens
token_attributions_global = token_attributions_global[:100]
feat_to_token_attributions_global[feat_name] = token_attributions_global
self.global_explanation.add(
input_features.keys(), total_attribution_global, feat_to_token_attributions_global
)
for i, (feature_attributions, explanation) in enumerate(zip(total_attribution, self.row_explanations)):
# Add the feature attributions to the explanation object for this row.
explanation.add(
input_features.keys(),
feature_attributions,
{k: v[i] for k, v in feat_to_token_attributions.items()},
)
# TODO(travis): for force plots, need something similar to SHAP E[X]
expected_values.append(0.0)
# For binary targets, add an extra attribution for the negative class (false).
if self.is_binary_target:
le_true = self.global_explanation.label_explanations[0]
negated_attributions = le_true.to_array() * -1
negated_token_attributions = {
fa.feature_name: [(t, -a) for t, a in fa.token_attributions]
for fa in le_true.feature_attributions
if fa.token_attributions is not None
}
# Prepend the negative class to the list of label explanations.
self.global_explanation.add(
input_features.keys(), negated_attributions, negated_token_attributions, prepend=True
)
for explanation in self.row_explanations:
le_true = explanation.label_explanations[0]
negated_attributions = le_true.to_array() * -1
negated_token_attributions = {
fa.feature_name: [(t, -a) for t, a in fa.token_attributions]
for fa in le_true.feature_attributions
if fa.token_attributions is not None
}
# Prepend the negative class to the list of label explanations.
explanation.add(input_features.keys(), negated_attributions, negated_token_attributions, prepend=True)
# TODO(travis): for force plots, need something similar to SHAP E[X]
expected_values.append(0.0)
return ExplanationsResult(self.global_explanation, self.row_explanations, expected_values)
@ray.remote(max_calls=1)
def get_input_tensors_task(
model: LudwigModel, df: pd.DataFrame, run_config: ExplanationRunConfig
) -> tuple[list[Variable], ExplanationRunConfig]:
model.model.unskip()
model.model.to(get_torch_device())
try:
get_total_attribution_with_retry = retry_with_halved_batch_size(run_config)(get_input_tensors)
return get_total_attribution_with_retry(model, df, run_config), run_config
finally:
model.model.cpu()
@ray.remote(max_calls=1)
def get_total_attribution_task(
model: LudwigModel,
target_feature_name: str,
target_indices: list[int | None],
inputs_encoded: list[Variable],
baseline: list[Variable],
nsamples: int,
run_config: ExplanationRunConfig,
) -> list[np.array]:
model.model.unskip()
model.model.to(get_torch_device())
try:
get_total_attribution_with_retry = retry_with_halved_batch_size(run_config)(get_total_attribution)
return [
get_total_attribution_with_retry(
model=model,
target_feature_name=target_feature_name,
target_idx=target_idx,
feature_inputs=inputs_encoded,
baseline=baseline,
nsamples=nsamples,
run_config=run_config,
)
for target_idx in tqdm(target_indices, desc="Explain")
]
finally:
model.model.cpu()
def split_list(v, n):
"""Splits a list into n roughly equal sub-lists.
Source: https://stackoverflow.com/a/2135920
"""
k, m = divmod(len(v), n)
return (v[i * k + min(i, m) : (i + 1) * k + min(i + 1, m)] for i in range(n))
================================================
FILE: ludwig/explain/explainer.py
================================================
from abc import ABCMeta, abstractmethod
import pandas as pd
from ludwig.api import LudwigModel
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BINARY, CATEGORY, TYPE
from ludwig.explain.explanation import Explanation, ExplanationsResult
from ludwig.explain.util import prepare_data
@DeveloperAPI
class Explainer(metaclass=ABCMeta):
def __init__(
self,
model: LudwigModel,
inputs_df: pd.DataFrame,
sample_df: pd.DataFrame,
target: str,
):
"""Constructor for the explainer.
# Inputs
:param model: (LudwigModel) The LudwigModel to explain.
:param inputs_df: (pd.DataFrame) The input data to explain.
:param sample_df: (pd.DataFrame) A sample of the ground truth data.
:param target: (str) The name of the target to explain.
"""
self.model = model
self.inputs_df = inputs_df
self.sample_df = sample_df
self.target = target
self.inputs_df, self.sample_df, self.feature_cols, self.target_feature_name = prepare_data(
model, inputs_df, sample_df, target
)
self.global_explanation = Explanation(self.target_feature_name)
self.row_explanations = [Explanation(self.target_feature_name) for _ in self.inputs_df.index]
# Lookup from column name to output feature
config = self.model.config
self.output_feature_map = {feature["column"]: feature for feature in config["output_features"]}
@property
def is_binary_target(self) -> bool:
"""Whether the target is binary."""
return self.output_feature_map[self.target_feature_name][TYPE] == BINARY
@property
def is_category_target(self) -> bool:
"""Whether the target is categorical."""
return self.output_feature_map[self.target_feature_name][TYPE] == CATEGORY
@property
def vocab_size(self) -> int:
"""The vocab size of the target feature.
For regression (number) this is 1, for binary it is 2, and for category it is the vocab size.
"""
if self.is_category_target:
return self.model.training_set_metadata[self.target_feature_name]["vocab_size"]
elif self.is_binary_target:
return 2
return 1
@abstractmethod
def explain(self) -> ExplanationsResult:
"""Explain the model's predictions.
# Return
:return: ExplanationsResult containing the explanations.
"""
================================================
FILE: ludwig/explain/explanation.py
================================================
from dataclasses import dataclass, field
import numpy as np
import numpy.typing as npt
from ludwig.api_annotations import DeveloperAPI, PublicAPI
@DeveloperAPI
@dataclass
class FeatureAttribution:
"""Stores the attribution for a single input feature."""
# The name of the input feature.
feature_name: str
# The scalar attribution for the input feature.
attribution: float
# (Optional) The attribution for each token in the input feature as an array of shape (seq_len, 2).
token_attributions: list[tuple[str, float]] = None
@DeveloperAPI
@dataclass
class LabelExplanation:
"""Stores the feature attributions for a single label in the target feature's vocab."""
# The attribution for each input feature.
feature_attributions: list[FeatureAttribution] = field(default_factory=list)
def add(self, feature_name: str, attribution: float, token_attributions: list[tuple[str, float]] = None):
"""Add the attribution for a single input feature."""
self.feature_attributions.append(FeatureAttribution(feature_name, attribution, token_attributions))
def to_array(self) -> npt.NDArray[np.float64]:
"""Convert the explanation to a 1D array of shape (num_features,)."""
return np.array([fa.attribution for fa in self.feature_attributions])
@DeveloperAPI
@dataclass
class Explanation:
"""Stores the explanations for a single row of input data.
Contains the feature attributions for each label in the target feature's vocab.
"""
target: str
# The explanations for each label in the vocab of the target feature.
label_explanations: list[LabelExplanation] = field(default_factory=list)
def add(
self,
feat_names: list[str],
feat_attributions: npt.NDArray[np.float64],
feat_to_token_attributions: dict[str, list[tuple[str, float]]] = None,
prepend: bool = False,
):
"""Add the feature attributions for a single label."""
assert len(feat_names) == len(
feat_attributions
), f"Expected {len(feat_names)} feature attributions, got {len(feat_attributions)}"
if len(self.label_explanations) > 0:
# Check that the feature attributions are the same shape as existing explanations.
assert self.label_explanations[0].to_array().shape == feat_attributions.shape, (
f"Expected feature attributions of shape {self.label_explanations[0].to_array().shape}, "
f"got {feat_attributions.shape}"
)
le = LabelExplanation()
for i, feat_name in enumerate(feat_names):
le.add(
feat_name,
feat_attributions[i],
feat_to_token_attributions.get(feat_name) if feat_to_token_attributions else None,
)
self.label_explanations.insert(0, le) if prepend else self.label_explanations.append(le)
def to_array(self) -> npt.NDArray[np.float64]:
"""Convert the explanation to a 2D array of shape (num_labels, num_features)."""
return np.array([le.to_array() for le in self.label_explanations])
@PublicAPI(stability="experimental")
@dataclass
class ExplanationsResult:
# Aggregate explanation for the entire input data.
global_explanation: Explanation # GlobalExplanation
# A list of explanations, one for each row in the input data.
# Each explanation contains the feature attributions for each label in the target feature's vocab.
row_explanations: list[Explanation]
# Expected value for each label in the target feature's vocab.
expected_values: list[float]
================================================
FILE: ludwig/explain/util.py
================================================
from copy import deepcopy
import pandas as pd
import torch
from ludwig.api import LudwigModel
from ludwig.constants import COLUMN, INPUT_FEATURES, PREPROCESSING, SPLIT
from ludwig.data.split import get_splitter
from ludwig.features.base_feature import BaseFeature
def filter_cols(df, cols):
cols = {c.lower() for c in cols}
retain_cols = [c for c in df.columns if c.lower() in cols]
return df[retain_cols]
def prepare_data(model: LudwigModel, inputs_df: pd.DataFrame, sample_df: pd.DataFrame, target: str):
config = model.config
feature_cols = [feature[COLUMN] for feature in config[INPUT_FEATURES]]
if SPLIT in config.get(PREPROCESSING, {}):
# Keep columns required for Ludwig preprocessing
splitter = get_splitter(**config[PREPROCESSING][SPLIT])
feature_cols += splitter.required_columns
target_feature_name = get_feature_name(model, target)
inputs_df = filter_cols(inputs_df, feature_cols)
if sample_df is not None:
sample_df = filter_cols(sample_df, feature_cols)
return inputs_df, sample_df, feature_cols, target_feature_name
def get_pred_col(preds, target):
t = target.lower()
for c in preds.keys():
if c.lower() == t:
if "probabilities" in preds[c]:
return preds[c]["probabilities"]
else:
return preds[c]["predictions"]
raise ValueError(f"Unable to find target column {t} in {preds.keys()}")
def get_feature_name(model: LudwigModel, target: str) -> str:
t = target.lower()
for c in model.training_set_metadata.keys():
if c.lower() == t:
return c
raise ValueError(f"Unable to find target column {t} in {model.training_set_metadata.keys()}")
def get_absolute_module_key_from_submodule(module: torch.nn.Module, submodule: torch.nn.Module):
"""Get the absolute module key for each param in the target layer.
Assumes that the keys in the submodule are relative to the module.
We find the params from the submodule in the module by comparing the data
pointers, since the data returned by named_parameters is by reference.
More information on checking if tensors point to the same place in storage can be found here:
https://discuss.pytorch.org/t/any-way-to-check-if-two-tensors-have-the-same-base/44310/2
"""
absolute_keys = []
for module_key, module_param in module.named_parameters():
for _, submodule_param in submodule.named_parameters():
if submodule_param.data_ptr() == module_param.data_ptr():
absolute_keys.append(module_key)
break
return absolute_keys
def replace_layer_with_copy(feat: BaseFeature, target_layer: torch.nn.Module):
"""Replaces a layer in a feature with a copy of the layer in-place.
This is useful in a tied weights scenario, where a single encoder may be used by multiple features. If we leave
as-is, Captum complains about the resulting computation graph. The solution is to create an identical
(deep) copy of the layer fed into Captum: https://github.com/pytorch/captum/issues/794#issuecomment-1093021638
This is safe to do during the explain step because we are essentially running inference, and no model artifacts are
being saved during the explain step.
TODO(geoffrey): if a user ever wants to train immediately after explain (i.e. w/o loading weights from the disk),
we might want to implement this as a context so that we can restore the original encoder object at the end.
Will defer this implementation for now because that scenario seems unlikely.
At a high-level the approach is the following:
1. Create a deep-copy of the entire encoder object and set it as the feature's encoder object
2. Replace the tensors in the copied encoder object with the tensors from the original encoder object, except for
the tensors in the target layer. We want to explain these tensors, so we want to keep them as deep copies.
This approach ensures that at most 2 copies of the encoder object are in memory at any given time.
"""
with torch.no_grad():
# Get the original encoder object and a mapping from param names to the params themselves.
orig_encoder_obj = feat.encoder_obj
orig_encoder_obj_state_dict = orig_encoder_obj.state_dict()
# Deep copy the original encoder object and set the copy as this feature's encoder object.
copy_encoder_obj = deepcopy(orig_encoder_obj)
feat.encoder_obj = copy_encoder_obj
# We have to get the absolute module key in order to do string matching because the target_layer keys are
# relative to itself. If we were to leave it as-is and attempt to suffix match, we may get duplicates for
# common layers i.e. "LayerNorm.weight" and "LayerNorm.bias". Getting the absolute module key ensures we
# use values like "transformer.module.embedding.LayerNorm.weight" instead.
keys_to_keep_copy = get_absolute_module_key_from_submodule(orig_encoder_obj, target_layer)
# Get the tensors to keep from the copied encoder object. These are the tensors in the target layer.
for key, param in copy_encoder_obj.named_parameters():
if key not in keys_to_keep_copy:
param.data = orig_encoder_obj_state_dict[key].data
================================================
FILE: ludwig/export.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import os
import sys
from ludwig.api import LudwigModel
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig
from ludwig.utils.triton_utils import export_triton as utils_export_triton
logger = logging.getLogger(__name__)
def export_torchscript(
model_path: str, model_only: bool = False, output_path: str | None = None, device: str | None = None, **kwargs
) -> None:
"""Exports a model to torchscript.
# Inputs
:param model_path: (str) filepath to pre-trained model.
:param model_only: (bool, default: `False`) If true, scripts and exports the model only.
:param output_path: directory to store torchscript. If `None`, defaults to model_path
# Return
:returns: (`None`)
"""
logger.info(f"Model path: {model_path}")
logger.info(f"Saving model only: {model_only}")
if output_path is None:
logger.info("output_path is None, defaulting to model_path")
output_path = model_path
logger.info(f"Output path: {output_path}")
logger.info("\n")
model = LudwigModel.load(model_path)
os.makedirs(output_path, exist_ok=True)
model.save_torchscript(output_path, model_only=model_only, device=device)
logger.info(f"Saved to: {output_path}")
def export_triton(model_path, output_path="model_repository", model_name="ludwig_model", model_version=1, **kwargs):
"""Exports a model in torchscript format with config for Triton serving.
# Inputs
:param model_path: (str) filepath to pre-trained model.
:param output_path: (str, default: `'model_repository'`) directory to store the
triton models.
:param model_name: (str, default: `'ludwig_model'`) save triton under this name.
:param model_name: (int, default: `1`) save model under this verison.
# Return
:returns: (`None`)
"""
logger.info(f"Model path: {model_path}")
logger.info(f"Output path: {output_path}")
logger.info(f"Model name: {model_name}")
logger.info(f"Model version: {model_version}")
logger.info("\n")
model = LudwigModel.load(model_path)
os.makedirs(output_path, exist_ok=True)
utils_export_triton(model=model, output_path=output_path, model_name=model_name, model_version=model_version)
logger.info(f"Saved to: {output_path}")
def export_mlflow(model_path, output_path="mlflow", registered_model_name=None, **kwargs):
"""Exports a model to MLflow.
# Inputs
:param model_path: (str) filepath to pre-trained model.
:param output_path: (str, default: `'mlflow'`) directory to store the
mlflow model.
:param registered_model_name: (str, default: `None`) save mlflow under this
name in the model registry. Saved locally if `None`.
# Return
:returns: (`None`)
"""
logger.info(f"Model path: {model_path}")
logger.info(f"Output path: {output_path}")
logger.info("\n")
from ludwig.contribs.mlflow.model import export_model
export_model(model_path, output_path, registered_model_name)
logger.info(f"Saved to: {output_path}")
def cli_export_torchscript(sys_argv):
parser = argparse.ArgumentParser(
description="This script loads a pretrained model " "and saves it as torchscript.",
prog="ludwig export_torchscript",
usage="%(prog)s [options]",
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=True)
parser.add_argument(
"-mo",
"--model_only",
help="Script and export the model only.",
action="store_true",
)
parser.add_argument(
"-d",
"--device",
type=str,
help=(
'Device to use for torchscript tracing (e.g. "cuda" or "cpu"). Ideally, this is the same as the device '
"used when the model is loaded."
),
default=None,
)
# -----------------
# Output parameters
# -----------------
parser.add_argument(
"-op",
"--output_path",
type=str,
help="path where to save the export model. If not specified, defaults to model_path.",
default=None,
)
# ------------------
# Runtime parameters
# ------------------
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("export_torchscript", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.export")
print_ludwig("Export Torchscript", LUDWIG_VERSION)
export_torchscript(**vars(args))
def cli_export_triton(sys_argv):
parser = argparse.ArgumentParser(
description="This script loads a pretrained model " "and saves it as torchscript for Triton.",
prog="ludwig export_triton",
usage="%(prog)s [options]",
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=True)
parser.add_argument("-mn", "--model_name", help="model name", default="ludwig_model")
parser.add_argument("-mv", "--model_version", type=int, help="model version", default=1)
# -----------------
# Output parameters
# -----------------
parser.add_argument("-op", "--output_path", type=str, help="path where to save the export model", required=True)
# ------------------
# Runtime parameters
# ------------------
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("export_triton", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.export")
print_ludwig("Export Triton", LUDWIG_VERSION)
export_triton(**vars(args))
def cli_export_mlflow(sys_argv):
parser = argparse.ArgumentParser(
description="This script loads a pretrained model " "and saves it as an MLFlow model.",
prog="ludwig export_mlflow",
usage="%(prog)s [options]",
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=True)
parser.add_argument(
"-mn",
"--registered_model_name",
help="model name to upload to in MLflow model registry",
)
# -----------------
# Output parameters
# -----------------
parser.add_argument(
"-op", "--output_path", type=str, help="path where to save the exported model", default="mlflow"
)
# ------------------
# Runtime parameters
# ------------------
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("export_mlflow", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.export")
print_ludwig("Export MLFlow", LUDWIG_VERSION)
export_mlflow(**vars(args))
if __name__ == "__main__":
if len(sys.argv) > 1:
if sys.argv[1] == "savedmodel":
cli_export_torchscript(sys.argv[2:])
elif sys.argv[1] == "mlflow":
cli_export_mlflow(sys.argv[2:])
elif sys.argv[1] == "triton":
cli_export_triton(sys.argv[2:])
else:
print("Unrecognized command")
else:
print("Unrecognized command")
================================================
FILE: ludwig/features/__init__.py
================================================
================================================
FILE: ludwig/features/audio_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import os
import numpy as np
import torch
import torchaudio
from packaging import version
from ludwig.constants import AUDIO, AUDIO_FEATURE_KEYS, COLUMN, NAME, PREPROCESSING, PROC_COLUMN, SRC, TYPE
from ludwig.features.base_feature import BaseFeatureMixin
from ludwig.features.sequence_feature import SequenceInputFeature
from ludwig.schema.features.audio_feature import AudioInputFeatureConfig
from ludwig.types import FeatureMetadataDict, ModelConfigDict, PreprocessingConfigDict, TrainingSetMetadataDict
from ludwig.utils.audio_utils import (
calculate_mean,
calculate_var,
get_default_audio,
get_fbank,
get_group_delay,
get_length_in_samp,
get_max_length_stft_based,
get_non_symmetric_length,
get_phase_stft_magnitude,
get_stft_magnitude,
is_torch_audio_tuple,
read_audio_from_bytes_obj,
read_audio_from_path,
)
from ludwig.utils.data_utils import get_abs_path
from ludwig.utils.fs_utils import has_remote_protocol
from ludwig.utils.misc_utils import set_default_value
from ludwig.utils.types import TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
_TORCH_200 = version.parse(torch.__version__) >= version.parse("2.0.0")
class _AudioPreprocessing(torch.nn.Module):
audio_feature_dict: dict[str, float | int | str]
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.audio_feature_dict = {
key: value
for key, value in metadata["preprocessing"].items()
if key in AUDIO_FEATURE_KEYS and value is not None
}
self.feature_dim = metadata["feature_dim"]
self.max_length = metadata["max_length"]
self.padding_value = metadata["preprocessing"]["padding_value"]
self.normalization_type = metadata["preprocessing"]["norm"]
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
if not torch.jit.isinstance(v, list[tuple[torch.Tensor, int]]):
raise ValueError(f"Unsupported input: {v}")
processed_audio_matrix = []
for audio, sampling_rate_in_hz in v:
processed_audio = AudioFeatureMixin._transform_to_feature(
audio,
sampling_rate_in_hz,
self.audio_feature_dict,
self.feature_dim,
self.max_length,
self.padding_value,
self.normalization_type,
)
processed_audio_matrix.append(processed_audio)
return torch.stack(processed_audio_matrix)
class AudioFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return AUDIO
@staticmethod
def cast_column(column, backend):
return column
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
first_audio_file_path = column.head(1).iloc[0]
_, sampling_rate_in_hz = torchaudio.load(first_audio_file_path)
feature_dim = AudioFeatureMixin._get_feature_dim(preprocessing_parameters, sampling_rate_in_hz)
audio_file_length_limit_in_s = preprocessing_parameters["audio_file_length_limit_in_s"]
max_length = AudioFeatureMixin._get_max_length_feature(
preprocessing_parameters, sampling_rate_in_hz, audio_file_length_limit_in_s
)
return {
"feature_dim": feature_dim,
"sampling_rate_in_hz": sampling_rate_in_hz,
"max_length": max_length,
"reshape": (max_length, feature_dim),
}
@staticmethod
def _get_feature_dim(preprocessing_parameters: PreprocessingConfigDict, sampling_rate_in_hz):
feature_type = preprocessing_parameters[TYPE]
if feature_type == "raw":
feature_dim = 1
elif feature_type == "stft_phase":
feature_dim_symmetric = get_length_in_samp(
preprocessing_parameters["window_length_in_s"], sampling_rate_in_hz
)
feature_dim = 2 * get_non_symmetric_length(feature_dim_symmetric)
elif feature_type in ["stft", "group_delay"]:
feature_dim_symmetric = get_length_in_samp(
preprocessing_parameters["window_length_in_s"], sampling_rate_in_hz
)
feature_dim = get_non_symmetric_length(feature_dim_symmetric)
elif feature_type == "fbank":
feature_dim = preprocessing_parameters["num_filter_bands"]
else:
raise ValueError(f"{feature_type} is not recognized.")
return feature_dim
@staticmethod
def _process_in_memory(
column,
audio_feature_dict,
feature_dim,
max_length,
padding_value,
normalization_type,
audio_file_length_limit_in_s,
backend,
):
df_engine = backend.df_engine
if _TORCH_200:
# Read audio from path if the version of torch is >= 2.0.0.
raw_audio = backend.read_binary_files(column, map_fn=read_audio_from_path)
else:
raw_audio = backend.read_binary_files(column, map_fn=read_audio_from_bytes_obj)
try:
default_audio = get_default_audio([audio for audio in raw_audio if is_torch_audio_tuple(audio)])
except RuntimeError as e:
raise RuntimeError(f"Unable to process audio files provided: {e}") from e
raw_audio = df_engine.map_objects(raw_audio, lambda row: row if is_torch_audio_tuple(row) else default_audio)
processed_audio = df_engine.map_objects(
raw_audio,
lambda row: AudioFeatureMixin._transform_to_feature(
audio=row[0],
sampling_rate_in_hz=row[1],
audio_feature_dict=audio_feature_dict,
feature_dim=feature_dim,
max_length=max_length,
padding_value=padding_value,
normalization_type=normalization_type,
).numpy(), # non-torchscript preprocessing requires np.ndarray
)
audio_stats = df_engine.map_objects(
raw_audio,
lambda row: AudioFeatureMixin._get_stats(
audio=row[0],
sampling_rate_in_hz=row[1],
max_length_in_s=audio_file_length_limit_in_s,
),
)
def reduce(series):
merged_stats = None
for audio_stats in series:
if merged_stats is None:
merged_stats = audio_stats.copy()
else:
AudioFeatureMixin._merge_stats(merged_stats, audio_stats)
return merged_stats
merged_stats = df_engine.reduce_objects(audio_stats, reduce)
merged_stats["mean"] = calculate_mean(merged_stats["sum"], merged_stats["count"])
merged_stats["var"] = calculate_var(merged_stats["sum"], merged_stats["sum2"], merged_stats["count"])
merged_stats["std"] = np.sqrt(merged_stats["var"] / float(merged_stats["count"]))
print_statistics = (
"{} audio files loaded.\n"
"Statistics of audio file lengths:\n"
"- mean: {:.4f}\n"
"- std: {:.4f}\n"
"- max: {:.4f}\n"
"- min: {:.4f}\n"
"- cropped audio_files: {}\n"
"Max length was given as {}s"
).format(
merged_stats["count"],
merged_stats["mean"],
merged_stats["std"],
merged_stats["max"],
merged_stats["min"],
merged_stats["cropped"],
audio_file_length_limit_in_s,
)
logger.debug(print_statistics)
return processed_audio
@staticmethod
def _transform_to_feature(
audio: torch.Tensor,
sampling_rate_in_hz: int,
audio_feature_dict: dict[str, float | int | str],
feature_dim: int,
max_length: int,
padding_value: float,
normalization_type: str | None = None,
type_key: str = TYPE,
):
feature_type: str = str(audio_feature_dict[type_key])
if feature_type == "raw":
audio_feature = torch.unsqueeze(audio[0], dim=-1)
elif feature_type in ["stft", "stft_phase", "group_delay", "fbank"]:
audio_feature = AudioFeatureMixin._get_2D_feature(
audio, feature_type, audio_feature_dict, sampling_rate_in_hz
)
audio_feature = torch.transpose(audio_feature, 0, 1)
else:
raise ValueError(f"{feature_type} is not recognized.")
# Outer conditional is type refinement from Union[str, None] to str
if normalization_type is not None:
if normalization_type == "per_file":
mean = torch.mean(audio_feature, dim=0)
std = torch.std(audio_feature, dim=0)
audio_feature = torch.divide((audio_feature - mean), std + 1.0e-10)
elif normalization_type == "global":
raise ValueError("not implemented yet")
feature_length = audio_feature.shape[0]
broadcast_feature_length = min(feature_length, max_length)
audio_feature_padded = torch.full(
(max_length, feature_dim), padding_value, dtype=torch.float32, device=audio_feature.device
)
audio_feature_padded[:broadcast_feature_length, :] = audio_feature[:max_length, :]
return audio_feature_padded
@staticmethod
def _get_stats(audio, sampling_rate_in_hz, max_length_in_s):
audio_length_in_s = audio.shape[-1] / float(sampling_rate_in_hz)
return {
"count": 1,
"sum": audio_length_in_s,
"sum2": audio_length_in_s * audio_length_in_s,
"min": audio_length_in_s,
"max": audio_length_in_s,
"cropped": 1 if audio_length_in_s > max_length_in_s else 0,
}
@staticmethod
def _merge_stats(merged_stats, audio_stats):
merged_stats["count"] += audio_stats["count"]
merged_stats["sum"] += audio_stats["sum"]
merged_stats["sum2"] += audio_stats["sum2"]
merged_stats["min"] = min(merged_stats["min"], audio_stats["min"])
merged_stats["max"] = max(merged_stats["max"], audio_stats["max"])
merged_stats["cropped"] += audio_stats["cropped"]
@staticmethod
def _get_2D_feature(
audio: torch.Tensor,
feature_type: str,
audio_feature_dict: dict[str, float | int | str],
sampling_rate_in_hz: int,
) -> torch.Tensor:
window_length_in_s = audio_feature_dict["window_length_in_s"]
window_shift_in_s = audio_feature_dict["window_shift_in_s"]
assert torch.jit.isinstance(window_length_in_s, float)
assert torch.jit.isinstance(window_shift_in_s, float)
window_length_in_samp = get_length_in_samp(window_length_in_s, sampling_rate_in_hz)
if "num_fft_points" in audio_feature_dict:
num_fft_points = audio_feature_dict["num_fft_points"]
assert torch.jit.isinstance(num_fft_points, int)
if num_fft_points < window_length_in_samp:
raise ValueError(
"num_fft_points: {} < window length in "
"samples: {} (corresponds to window length"
" in s: {}".format(num_fft_points, window_length_in_s, window_length_in_samp)
)
else:
num_fft_points = window_length_in_samp
if "window_type" in audio_feature_dict:
window_type = audio_feature_dict["window_type"]
assert torch.jit.isinstance(window_type, str)
else:
window_type = "hamming"
if feature_type == "stft_phase":
return get_phase_stft_magnitude(
audio, sampling_rate_in_hz, window_length_in_s, window_shift_in_s, num_fft_points, window_type
)
elif feature_type == "stft":
return get_stft_magnitude(
audio, sampling_rate_in_hz, window_length_in_s, window_shift_in_s, num_fft_points, window_type
)
elif feature_type == "group_delay":
return get_group_delay(
audio, sampling_rate_in_hz, window_length_in_s, window_shift_in_s, num_fft_points, window_type
)
elif feature_type == "fbank":
num_filter_bands = audio_feature_dict["num_filter_bands"]
assert torch.jit.isinstance(num_filter_bands, int)
return get_fbank(
audio,
sampling_rate_in_hz,
window_length_in_s,
window_shift_in_s,
num_fft_points,
window_type,
num_filter_bands,
)
else:
raise ValueError(f'feature_type "{feature_type}" is not recognized.')
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
set_default_value(feature_config["preprocessing"], "in_memory", preprocessing_parameters["in_memory"])
name = feature_config[NAME]
column = input_df[feature_config[COLUMN]]
num_audio_files = len(column)
if num_audio_files == 0:
raise ValueError("There are no audio files in the dataset provided.")
first_audio_entry = next(iter(column))
logger.debug(f"Detected audio feature type is {type(first_audio_entry)}")
if not isinstance(first_audio_entry, str) and not isinstance(first_audio_entry, torch.Tensor):
raise ValueError(
"Invalid audio feature data type. Detected type is {}, "
"expected either string for local/remote file path or Torch Tensor.".format(type(first_audio_entry))
)
src_path = None
if SRC in metadata:
if isinstance(first_audio_entry, str) and not has_remote_protocol(first_audio_entry):
src_path = os.path.dirname(os.path.abspath(metadata.get(SRC)))
abs_path_column = backend.df_engine.map_objects( # This gets the CSV file path
column, lambda row: get_abs_path(src_path, row) if isinstance(row, str) else row
)
num_audio_utterances = len(input_df[feature_config[COLUMN]])
padding_value = preprocessing_parameters["padding_value"]
normalization_type = preprocessing_parameters["norm"]
feature_dim = metadata[name]["feature_dim"]
max_length = metadata[name]["max_length"]
audio_feature_dict = {
key: value
for key, value in preprocessing_parameters.items()
if key in AUDIO_FEATURE_KEYS and value is not None
}
audio_file_length_limit_in_s = preprocessing_parameters["audio_file_length_limit_in_s"]
if num_audio_utterances == 0:
raise ValueError("There are no audio files in the dataset provided.")
if feature_config[PREPROCESSING]["in_memory"]:
audio_features = AudioFeatureMixin._process_in_memory(
abs_path_column,
audio_feature_dict,
feature_dim,
max_length,
padding_value,
normalization_type,
audio_file_length_limit_in_s,
backend,
)
proc_df[feature_config[PROC_COLUMN]] = audio_features
return proc_df
@staticmethod
def _get_max_length_feature(
preprocessing_parameters: PreprocessingConfigDict, sampling_rate_in_hz, audio_length_limit_in_s
):
feature_type = preprocessing_parameters[TYPE]
audio_length_limit_in_samp = audio_length_limit_in_s * sampling_rate_in_hz
if not audio_length_limit_in_samp.is_integer():
raise ValueError(
"Audio_file_length_limit has to be chosen "
"so that {} (in s) * {} (sampling rate in Hz) "
"is an integer.".format(audio_length_limit_in_s, sampling_rate_in_hz)
)
audio_length_limit_in_samp = int(audio_length_limit_in_samp)
if feature_type == "raw":
return audio_length_limit_in_samp
elif feature_type in ["stft", "stft_phase", "group_delay", "fbank"]:
window_length_in_s = preprocessing_parameters["window_length_in_s"]
window_shift_in_s = preprocessing_parameters["window_shift_in_s"]
return get_max_length_stft_based(
audio_length_limit_in_samp, window_length_in_s, window_shift_in_s, sampling_rate_in_hz
)
else:
raise ValueError(f"{feature_type} is not recognized.")
class AudioInputFeature(AudioFeatureMixin, SequenceInputFeature):
def __init__(self, input_feature_config: AudioInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, encoder_obj=encoder_obj, **kwargs)
if not getattr(self.encoder_obj.config, "embedding_size", None):
raise ValueError("embedding_size has to be defined - " 'check "update_config_with_metadata()"')
if not getattr(self.encoder_obj.config, "max_sequence_length", None):
raise ValueError("max_sequence_length has to be defined - " 'check "update_config_with_metadata()"')
def forward(self, inputs, mask=None):
assert isinstance(inputs, torch.Tensor)
assert inputs.dtype == torch.float32
assert len(inputs.shape) == 3, f"expected 3D shape, found: {inputs.shape}"
encoder_output = self.encoder_obj(inputs, mask=mask)
return encoder_output
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.encoder_obj.config.max_sequence_length, self.encoder_obj.config.embedding_size])
@property
def input_dtype(self):
return torch.float32
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.encoder.max_sequence_length = feature_metadata["max_length"]
feature_config.encoder.embedding_size = feature_metadata["feature_dim"]
feature_config.encoder.should_embed = False
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _AudioPreprocessing(metadata)
@staticmethod
def get_schema_cls():
return AudioInputFeatureConfig
================================================
FILE: ludwig/features/bag_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from collections import Counter
import numpy as np
import torch
from ludwig.constants import BAG, COLUMN, NAME, PROC_COLUMN
from ludwig.features.base_feature import BaseFeatureMixin, InputFeature
from ludwig.features.feature_utils import set_str_to_idx
from ludwig.features.set_feature import _SetPreprocessing
from ludwig.schema.features.bag_feature import BagInputFeatureConfig
from ludwig.types import FeatureMetadataDict, ModelConfigDict, PreprocessingConfigDict, TrainingSetMetadataDict
from ludwig.utils.strings_utils import create_vocabulary
logger = logging.getLogger(__name__)
class BagFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return BAG
@staticmethod
def cast_column(column, backend):
return column.astype(str)
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
vocabulary = create_vocabulary(
column,
preprocessing_parameters["tokenizer"],
num_most_frequent=preprocessing_parameters["most_common"],
lowercase=preprocessing_parameters["lowercase"],
processor=backend.df_engine,
)
return {
"idx2str": vocabulary.vocab,
"str2idx": vocabulary.str2idx,
"str2freq": vocabulary.str2freq,
"vocab_size": len(vocabulary.str2idx),
"max_set_size": vocabulary.max_sequence_length,
}
@staticmethod
def feature_data(column, metadata, preprocessing_parameters: PreprocessingConfigDict, backend):
def to_vector(set_str):
bag_vector = np.zeros((len(metadata["str2idx"]),), dtype=np.float32)
col_counter = Counter(set_str_to_idx(set_str, metadata["str2idx"], preprocessing_parameters["tokenizer"]))
bag_vector[list(col_counter.keys())] = list(col_counter.values())
return bag_vector
return backend.df_engine.map_objects(column, to_vector)
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
proc_df[feature_config[PROC_COLUMN]] = BagFeatureMixin.feature_data(
input_df[feature_config[COLUMN]],
metadata[feature_config[NAME]],
preprocessing_parameters,
backend,
)
return proc_df
class BagInputFeature(BagFeatureMixin, InputFeature):
def __init__(self, input_feature_config: BagInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs):
assert isinstance(inputs, torch.Tensor)
# assert inputs.dtype == tf.bool # this fails
encoder_output = self.encoder_obj(inputs)
return encoder_output
@property
def input_shape(self) -> torch.Size:
return torch.Size([len(self.encoder_obj.config.vocab)])
@property
def output_shape(self) -> torch.Size:
return self.encoder_obj.output_shape
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.encoder.vocab = feature_metadata["idx2str"]
@staticmethod
def get_schema_cls():
return BagInputFeatureConfig
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _SetPreprocessing(metadata, is_bag=True)
================================================
FILE: ludwig/features/base_feature.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from abc import ABC, abstractmethod, abstractstaticmethod
from dataclasses import dataclass
from typing import Any
import torch
from torch import Tensor
from ludwig.constants import (
ENCODER_OUTPUT,
ENCODER_OUTPUT_STATE,
HIDDEN,
LENGTHS,
LOGITS,
LOSS,
PREDICTIONS,
PROBABILITIES,
)
from ludwig.decoders.registry import get_decoder_cls
from ludwig.encoders.registry import get_encoder_cls
from ludwig.features.feature_utils import get_input_size_with_dependencies
from ludwig.modules.fully_connected_modules import FCStack
from ludwig.modules.loss_modules import create_loss
from ludwig.modules.metric_modules import LossMetric, LudwigMetric, MeanMetric
from ludwig.modules.metric_registry import get_metric_classes, get_metric_cls, get_metric_tensor_input
from ludwig.modules.reduction_modules import SequenceReducer
from ludwig.schema.features.base import BaseFeatureConfig, BaseOutputFeatureConfig
from ludwig.types import (
FeatureConfigDict,
FeatureMetadataDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils import output_feature_utils
from ludwig.utils.calibration import CalibrationModule
from ludwig.utils.torch_utils import LudwigModule
from ludwig.utils.types import DataFrame, TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
class BaseFeatureMixin(ABC):
"""Parent class for feature mixins.
Feature mixins support preprocessing functionality shared across input and output features.
"""
@abstractstaticmethod
def type() -> str:
"""Returns the type of feature this mixin supports."""
raise NotImplementedError
@abstractstaticmethod
def cast_column(column: DataFrame, backend) -> DataFrame:
"""Returns a copy of the dataset column for the given feature, potentially after a type cast.
Args:
column: Pandas column of values.
backend: (Union[Backend, str]) Backend to use for feature data processing.
"""
raise NotImplementedError
@abstractstaticmethod
def get_feature_meta(
config: ModelConfigDict,
column: DataFrame,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
"""Returns a dictionary of feature metadata.
Args:
config: Ludwig model config dict.
column: Pandas column of values.
preprocessing_parameters: Preprocessing configuration for this feature.
backend: (Union[Backend, str]) Backend to use for feature data processing.
"""
raise NotImplementedError
@abstractstaticmethod
def add_feature_data(
feature_config: FeatureConfigDict,
input_df: DataFrame,
proc_df: dict[str, DataFrame],
metadata: TrainingSetMetadataDict,
preprocessing_parameters: PreprocessingConfigDict,
backend, # Union[Backend, str]
skip_save_processed_input: bool,
) -> None:
"""Runs preprocessing on the input_df and stores results in the proc_df and metadata dictionaries.
Args:
feature_config: Feature configuration.
input_df: Pandas column of values.
proc_df: Dict of processed columns of data. Feature data is added to this.
metadata: Metadata returned by get_feature_meta(). Additional information may be added to this.
preprocessing_parameters: Preprocessing configuration for this feature.
backend: (Union[Backend, str]) Backend to use for feature data processing.
skip_save_processed_input: Whether to skip saving the processed input.
"""
raise NotImplementedError
@dataclass
class ModuleWrapper:
"""Used to prevent the PredictModule from showing up an attribute on the feature module.
This is necessary to avoid inflight errors from DeepSpeed. These errors occur when DeepSpeed believes that a param
is still in the process of being processed asynchronously (allgathered, etc.).
"""
module: torch.nn.Module
class PredictModule(torch.nn.Module):
"""Base class for all modules that convert model outputs to predictions.
Explicit member variables needed here for scripting, as Torchscript will not be able to recognize global variables
during scripting.
"""
def __init__(self):
super().__init__()
self.predictions_key = PREDICTIONS
self.probabilities_key = PROBABILITIES
self.logits_key = LOGITS
class BaseFeature:
"""Base class for all features.
Note that this class is not-cooperative (does not forward kwargs), so when constructing feature class hierarchies,
there should be only one parent class that derives from base feature. Other functionality should be put into mixin
classes to avoid the diamond pattern.
"""
def __init__(self, feature: BaseFeatureConfig):
super().__init__()
if not feature.name:
raise ValueError("Missing feature name")
self.feature_name = feature.name
if not feature.column:
feature.column = self.feature_name
self.column = feature.column
self.proc_column = feature.proc_column
class InputFeature(BaseFeature, LudwigModule, ABC):
"""Parent class for all input features."""
def create_sample_input(self, batch_size: int = 2):
# Used by get_model_inputs(), which is used for tracing-based torchscript generation.
return torch.rand([batch_size, *self.input_shape]).to(self.input_dtype)
def unskip(self) -> "InputFeature":
"""Convert feature using passthrough wrapper back to full encoder."""
return self
@staticmethod
@abstractmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
pass
def update_config_after_module_init(self, feature_config):
"""Updates the config after the torch.nn.Module objects have been initialized."""
def initialize_encoder(self, encoder_config):
encoder_cls = get_encoder_cls(self.type(), encoder_config.type)
encoder_schema = encoder_cls.get_schema_cls().Schema()
encoder_params_dict = encoder_schema.dump(encoder_config)
return encoder_cls(encoder_config=encoder_config, **encoder_params_dict)
@classmethod
def get_preproc_input_dtype(cls, metadata: TrainingSetMetadataDict) -> str:
return "string"
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
raise NotImplementedError("Torchscript tracing not supported for feature")
class OutputFeature(BaseFeature, LudwigModule, ABC):
"""Parent class for all output features."""
def __init__(
self,
feature: BaseOutputFeatureConfig,
other_output_features: dict[str, "OutputFeature"],
*args,
**kwargs,
):
"""Defines defaults, overwrites them based on the feature dictionary, and sets up dependencies.
Any output feature can depend on one or more other output features. The `other_output_features` input dictionary
should contain entries for any dependent output features, which is accomplished by constructing output features
in topographically sorted order. Attributes of any dependent output features are used to properly initialize
this feature's sizes.
"""
super().__init__(feature)
# List of names of metrics that this OutputFeature computes.
self.metric_names = []
self.loss = feature.loss
self.reduce_input = feature.reduce_input
self.reduce_dependencies = feature.reduce_dependencies
# List of feature names that this output feature is dependent on.
self.dependencies = feature.dependencies
logger.debug(" output feature fully connected layers")
logger.debug(" FCStack")
self.input_size = get_input_size_with_dependencies(feature.input_size, self.dependencies, other_output_features)
feature.input_size = self.input_size
self.fc_stack = FCStack(
first_layer_input_size=self.input_size,
layers=feature.decoder.fc_layers,
num_layers=feature.decoder.num_fc_layers,
default_output_size=feature.decoder.fc_output_size,
default_use_bias=feature.decoder.fc_use_bias,
default_weights_initializer=feature.decoder.fc_weights_initializer,
default_bias_initializer=feature.decoder.fc_bias_initializer,
default_norm=feature.decoder.fc_norm,
default_norm_params=feature.decoder.fc_norm_params,
default_activation=feature.decoder.fc_activation,
default_dropout=feature.decoder.fc_dropout,
)
self._calibration_module = self.create_calibration_module(feature)
self._prediction_module = ModuleWrapper(self.create_predict_module())
# set up two sequence reducers, one for inputs and other for dependencies
self.reduce_sequence_input = SequenceReducer(reduce_mode=self.reduce_input)
if self.dependencies:
self.dependency_reducers = torch.nn.ModuleDict()
# todo: re-evaluate need for separate handling of `attention` reducer
# currently this code does not support `attention`
for dependency in self.dependencies:
self.dependency_reducers[dependency] = SequenceReducer(reduce_mode=self.reduce_dependencies)
def create_sample_output(self, batch_size: int = 2):
output_shape = self.output_shape
shape = [batch_size, *self.output_shape] if output_shape != torch.Size([1]) else [batch_size]
return torch.rand(shape).to(self.get_output_dtype())
@abstractmethod
def get_prediction_set(self):
"""Returns the set of tensor keys returned by this feature's PredictModule.
TODO(Justin): Move this to the PredictModule.
"""
raise NotImplementedError("OutputFeature is missing implementation for get_prediction_set.")
@classmethod
@abstractmethod
def get_output_dtype(cls):
"""Returns the Tensor data type feature outputs."""
def initialize_decoder(self, decoder_config):
# Input to the decoder is the output feature's FC hidden layer.
decoder_config.input_size = self.fc_stack.output_shape[-1]
decoder_cls = get_decoder_cls(self.type(), decoder_config.type)
decoder_schema = decoder_cls.get_schema_cls().Schema()
decoder_params_dict = decoder_schema.dump(decoder_config)
return decoder_cls(decoder_config=decoder_config, **decoder_params_dict)
def train_loss(self, targets: Tensor, predictions: dict[str, Tensor], feature_name):
loss_class = type(self.train_loss_function)
prediction_key = output_feature_utils.get_feature_concat_name(feature_name, loss_class.get_loss_inputs())
return self.train_loss_function(predictions[prediction_key], targets)
def eval_loss(self, targets: Tensor, predictions: dict[str, Tensor]):
loss_class = type(self.train_loss_function)
prediction_key = loss_class.get_loss_inputs()
if isinstance(self.eval_loss_metric, MeanMetric):
# MeanMetric's forward() implicitly updates the running average.
# For MeanMetrics, we use get_current_value() to compute the loss without changing the state. All metrics
# are updated at the BaseModel level as part of update_metrics().
return self.eval_loss_metric.get_current_value(predictions[prediction_key].detach(), targets)
return self.eval_loss_metric(predictions[prediction_key].detach(), targets)
def _setup_loss(self):
self.train_loss_function = create_loss(self.loss)
self._eval_loss_metric = ModuleWrapper(get_metric_cls(self.type(), self.loss.type)(config=self.loss))
def _setup_metrics(self):
kwargs = {}
for name, cls in get_metric_classes(self.type()).items():
if cls.can_report(self) and isinstance(cls, LossMetric):
kwargs[name] = cls(config=self.loss, **self.metric_kwargs())
elif cls.can_report(self):
kwargs[name] = cls(**self.metric_kwargs())
self._metric_functions = {
LOSS: self.eval_loss_metric,
**kwargs,
}
self.metric_names = sorted(list(self._metric_functions.keys()))
def create_calibration_module(self, feature: BaseOutputFeatureConfig) -> CalibrationModule:
"""Creates and returns a CalibrationModule that converts logits to a probability distribution."""
return None
@property
def eval_loss_metric(self) -> LudwigMetric:
return self._eval_loss_metric.module
@property
def calibration_module(self) -> torch.nn.Module:
"""Returns the CalibrationModule used to convert logits to a probability distribution."""
return self._calibration_module
@abstractmethod
def create_predict_module(self) -> PredictModule:
"""Creates and returns a `nn.Module` that converts raw model outputs (logits) to predictions.
This module is needed when generating the Torchscript model using scripting.
"""
raise NotImplementedError()
@property
def prediction_module(self) -> PredictModule:
"""Returns the PredictModule used to convert model outputs to predictions."""
return self._prediction_module.module
def predictions(self, all_decoder_outputs: dict[str, torch.Tensor], feature_name: str) -> dict[str, torch.Tensor]:
"""Computes actual predictions from the outputs of feature decoders.
TODO(Justin): Consider refactoring this to accept feature-specific decoder outputs.
Args:
all_decoder_outputs: A dictionary of {feature name}::{tensor_name} -> output tensor.
Returns:
Dictionary of tensors with predictions as well as any additional tensors that may be
necessary for computing evaluation metrics.
"""
return self.prediction_module(all_decoder_outputs, feature_name)
@abstractmethod
def logits(self, combiner_outputs: dict[str, torch.Tensor], target=None, **kwargs) -> dict[str, torch.Tensor]:
"""Unpacks and feeds combiner_outputs to the decoder. Invoked as part of the output feature's forward pass.
If target is not None, then we are in training.
Args:
combiner_outputs: Dictionary of tensors from the combiner's forward pass.
Returns:
Dictionary of decoder's output tensors (non-normalized), as well as any additional
tensors that may be necessary for computing predictions or evaluation metrics.
"""
raise NotImplementedError("OutputFeature is missing logits() implementation.")
def metric_kwargs(self) -> dict[str, Any]:
"""Returns arguments that are used to instantiate an instance of each metric class."""
return {}
def update_metrics(self, targets: Tensor, predictions: dict[str, Tensor]) -> None:
"""Updates metrics with the given targets and predictions.
Args:
targets: Tensor with target values for this output feature.
predictions: Dict of tensors returned by predictions().
"""
for metric_name, metric_fn in self._metric_functions.items():
prediction_key = get_metric_tensor_input(metric_name)
metric_fn = metric_fn.to(predictions[prediction_key].device)
metric_fn.update(predictions[prediction_key].detach(), targets)
def get_metrics(self):
metric_vals = {}
for metric_name, metric_fn in self._metric_functions.items():
try:
computed_metric = metric_fn.compute()
except Exception as e:
logger.exception(f"Caught exception computing metric: {metric_name} with error: {e}.")
continue
# Metrics from torchmetrics can be a straightforward tensor.
if isinstance(computed_metric, Tensor):
metric_vals[metric_name] = computed_metric.detach().cpu().numpy().item()
else:
# Metrics from torchmetrics can be a dict of tensors.
# For example, ROUGE is returned as a dictionary of tensors.
# Unpack.
for sub_metric_name, metric in computed_metric.items():
metric_vals[sub_metric_name] = metric.detach().cpu().numpy().item()
return metric_vals
def reset_metrics(self):
for _, metric_fn in self._metric_functions.items():
if metric_fn is not None:
metric_fn.reset()
def forward(
self,
combiner_outputs: dict[str, torch.Tensor],
other_output_feature_outputs: dict[str, torch.Tensor],
mask: torch.Tensor | None = None,
target: torch.Tensor | None = None,
) -> dict[str, torch.Tensor]:
"""Forward pass that takes in output from the combiner, and passes it through to the decoder.
Args:
combiner_outputs: Dict of outputs from the combiner.
other_output_feature_outputs: Dict of tensors from other output features. Used for resolving dependencies.
mask: (Unused). Tensor for masking.
target: Tensor with targets. During training, targets != None. During prediction, targets = None.
Returns:
Dict of output tensors, with at least 'last_hidden' and 'logits' as keys, as well as any additional tensor
results from the decoder.
"""
# extract the combined hidden layer
combiner_hidden = combiner_outputs["combiner_output"]
hidden = self.prepare_decoder_inputs(combiner_hidden, other_output_feature_outputs, mask=mask)
# ================ Predictions ================
logits_input = {HIDDEN: hidden}
# pass supplemental data from encoders to decoder
if ENCODER_OUTPUT_STATE in combiner_outputs:
logits_input[ENCODER_OUTPUT_STATE] = combiner_outputs[ENCODER_OUTPUT_STATE]
if LENGTHS in combiner_outputs:
logits_input[LENGTHS] = combiner_outputs[LENGTHS]
logits = self.logits(logits_input, target=target)
# For binary and number features, self.logits() is a tensor.
# There are two special cases where self.logits() is a dict:
# categorical
# keys: logits, projection_input
# sequence
# keys: logits
# TODO(Justin): Clean this up.
if isinstance(logits, Tensor):
logits = {"logits": logits}
# For multi-class features, we must choose a consistent tuple subset.
return {
# last_hidden used for dependencies processing
"last_hidden": hidden,
**logits,
}
@abstractmethod
def postprocess_predictions(
self,
result: dict[str, Tensor],
metadata: TrainingSetMetadataDict,
):
raise NotImplementedError
@classmethod
def get_postproc_output_dtype(cls, metadata: TrainingSetMetadataDict) -> str:
return "string"
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
raise NotImplementedError("Torchscript tracing not supported for feature")
@staticmethod
@abstractmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
pass
@staticmethod
@abstractmethod
def calculate_overall_stats(predictions, targets, train_set_metadata):
pass
def output_specific_fully_connected(self, inputs, mask=None):
feature_hidden = inputs
original_feature_hidden = inputs
# flatten inputs
if len(original_feature_hidden.shape) > 2:
feature_hidden = torch.reshape(feature_hidden, (-1, list(feature_hidden.shape)[-1]))
# pass it through fc_stack
feature_hidden = self.fc_stack(feature_hidden, mask=mask)
feature_hidden_size = feature_hidden.shape[-1]
# reshape back to original first and second dimension
if len(original_feature_hidden.shape) > 2:
sequence_length = original_feature_hidden.shape[1]
feature_hidden = torch.reshape(feature_hidden, (-1, sequence_length, feature_hidden_size))
return feature_hidden
def prepare_decoder_inputs(
self, combiner_hidden: Tensor, other_output_features: dict[str, Tensor], mask=None
) -> Tensor:
"""Takes the combiner output and the outputs of other outputs features computed so far and performs:
- reduction of combiner outputs (if needed)
- concatenating the outputs of dependent features (if needed)
- output_specific fully connected layers (if needed)
Args:
combiner_hidden: hidden state of the combiner
other_output_features: output tensors from other output features
"""
# ================ Reduce Inputs ================
feature_hidden = combiner_hidden
if self.reduce_input is not None and len(combiner_hidden.shape) > 2:
feature_hidden = self.reduce_sequence_input(combiner_hidden)
# ================ Concat Dependencies ================
if self.dependencies:
feature_hidden = output_feature_utils.concat_dependencies(
self.column, self.dependencies, self.dependency_reducers, feature_hidden, other_output_features
)
# ================ Output-wise Fully Connected ================
feature_hidden = self.output_specific_fully_connected(feature_hidden, mask=mask)
return feature_hidden
class PassthroughPreprocModule(torch.nn.Module):
"""Combines preprocessing and encoding into a single module for TorchScript inference.
For encoder outputs that were cached during preprocessing, the encoder is simply the identity function in the ECD
module. As such, we need this module to apply the encoding that would normally be done during preprocessing for
realtime inference.
"""
def __init__(self, preproc: torch.nn.Module, encoder: torch.nn.Module):
self.preproc = preproc
self.encoder = encoder
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
preproc_v = self.preproc(v)
return self.encoder(preproc_v)
def create_passthrough_input_feature(feature: InputFeature, config: BaseFeatureConfig) -> InputFeature:
"""Creates a shim input feature that acts as a transparent identifiy function on the input data.
Used when the feature's encoder embeddings were cached in preprocessing. This way, we don't need to make any changes
to the underlying interface in such cases other than to swap the feature that would normally do the encoding with
this one.
"""
class _InputPassthroughFeature(InputFeature):
def __init__(self, config: BaseFeatureConfig):
super().__init__(config)
def forward(self, inputs, mask=None):
assert isinstance(inputs, torch.Tensor)
return {ENCODER_OUTPUT: inputs}
@property
def input_dtype(self):
# Doesn't matter as combiner will need to cast them to float32 anyway
return torch.float32
@property
def input_shape(self):
return feature.encoder_obj.output_shape
@property
def output_shape(self) -> torch.Size:
return feature.encoder_obj.output_shape
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
return feature.update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs)
@staticmethod
def get_schema_cls():
return feature.get_schema_cls()
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return PassthroughPreprocModule(feature.create_preproc_module(metadata), feature)
@staticmethod
def type():
return feature.type()
def unskip(self) -> InputFeature:
return feature
@property
def encoder_obj(self) -> torch.nn.Module:
return feature.encoder_obj
return _InputPassthroughFeature(config)
================================================
FILE: ludwig/features/binary_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import numpy as np
import torch
from ludwig.constants import BINARY, COLUMN, HIDDEN, LOGITS, NAME, PREDICTIONS, PROBABILITIES, PROBABILITY, PROC_COLUMN
from ludwig.error import InputDataError
from ludwig.features.base_feature import BaseFeatureMixin, InputFeature, OutputFeature, PredictModule
from ludwig.schema.features.binary_feature import BinaryInputFeatureConfig, BinaryOutputFeatureConfig
from ludwig.types import (
FeatureConfigDict,
FeatureMetadataDict,
FeaturePostProcessingOutputDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils import calibration, output_feature_utils, strings_utils
from ludwig.utils.eval_utils import (
average_precision_score,
ConfusionMatrix,
precision_recall_curve,
roc_auc_score,
roc_curve,
)
from ludwig.utils.types import DataFrame, TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
class _BinaryPreprocessing(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
str2bool = metadata.get("str2bool")
self.str2bool = str2bool or {v: True for v in strings_utils.BOOL_TRUE_STRS}
self.should_lower = str2bool is None
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
if torch.jit.isinstance(v, list[tuple[torch.Tensor, int]]):
raise ValueError(f"Unsupported input: {v}")
if torch.jit.isinstance(v, list[torch.Tensor]):
v = torch.stack(v)
if torch.jit.isinstance(v, torch.Tensor):
return v.to(dtype=torch.float32)
v = [s.strip() for s in v]
if self.should_lower:
v = [s.lower() for s in v]
indices = [self.str2bool.get(s, False) for s in v]
return torch.tensor(indices, dtype=torch.float32)
class _BinaryPostprocessing(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
bool2str = metadata.get("bool2str")
self.bool2str = {i: v for i, v in enumerate(bool2str)} if bool2str is not None else None
self.predictions_key = PREDICTIONS
self.probabilities_key = PROBABILITIES
def forward(self, preds: dict[str, torch.Tensor], feature_name: str) -> FeaturePostProcessingOutputDict:
predictions = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.predictions_key)
probabilities = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.probabilities_key)
if self.bool2str is not None:
predictions = predictions.to(dtype=torch.int32)
predictions = [self.bool2str.get(pred, self.bool2str[0]) for pred in predictions]
probabilities = torch.stack([1 - probabilities, probabilities], dim=-1)
return {
self.predictions_key: predictions,
self.probabilities_key: probabilities,
}
class _BinaryPredict(PredictModule):
def __init__(self, threshold, calibration_module=None):
super().__init__()
self.threshold = threshold
self.calibration_module = calibration_module
def forward(self, inputs: dict[str, torch.Tensor], feature_name: str) -> dict[str, torch.Tensor]:
logits = output_feature_utils.get_output_feature_tensor(inputs, feature_name, self.logits_key)
if self.calibration_module is not None:
probabilities = self.calibration_module(logits)
else:
probabilities = torch.sigmoid(logits)
predictions = probabilities >= self.threshold
return {
self.probabilities_key: probabilities,
self.predictions_key: predictions,
self.logits_key: logits,
}
class BinaryFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return BINARY
@staticmethod
def cast_column(column, backend):
"""Cast column of dtype object to bool.
Unchecked casting to boolean when given a column of dtype object converts all non-empty cells to True. We check
the values of the column directly and manually determine the best dtype to use.
"""
values = backend.df_engine.compute(column.drop_duplicates())
if strings_utils.values_are_pandas_numbers(values):
# If numbers, convert to float so it can be converted to bool
column = column.astype(float).astype(bool)
elif strings_utils.values_are_pandas_bools(values):
# If booleans, manually assign boolean values
column = backend.df_engine.map_objects(
column, lambda x: x.lower() in strings_utils.PANDAS_TRUE_STRS
).astype(bool)
else:
# If neither numbers or booleans, they are strings (objects)
column = column.astype(object)
return column
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column: DataFrame,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
if column.dtype != object:
return {}
distinct_values = backend.df_engine.compute(column.drop_duplicates())
if len(distinct_values) > 2:
raise InputDataError(
column.name, BINARY, f"expects 2 distinct values, found {distinct_values.values.tolist()}"
)
if preprocessing_parameters["fallback_true_label"]:
fallback_true_label = preprocessing_parameters["fallback_true_label"]
else:
fallback_true_label = sorted(distinct_values)[0]
preprocessing_parameters["fallback_true_label"] = fallback_true_label
try:
str2bool = {v: strings_utils.str2bool(v) for v in distinct_values}
except Exception as e:
logger.warning(
f"Binary feature {column.name} has at least 1 unconventional boolean value: {e}. "
f"We will now interpret {fallback_true_label} as 1 and the other values as 0. "
f"If this is incorrect, please use the category feature type or "
f"manually specify the true value with `preprocessing.fallback_true_label`."
)
str2bool = {v: strings_utils.str2bool(v, fallback_true_label) for v in distinct_values}
bool2str = [k for k, v in sorted(str2bool.items(), key=lambda item: item[1])]
return {"str2bool": str2bool, "bool2str": bool2str, "fallback_true_label": fallback_true_label}
@staticmethod
def add_feature_data(
feature_config: FeatureConfigDict,
input_df: DataFrame,
proc_df: dict[str, DataFrame],
metadata: TrainingSetMetadataDict,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input: bool,
) -> None:
column = input_df[feature_config[COLUMN]]
if column.dtype == object:
metadata = metadata[feature_config[NAME]]
if "str2bool" in metadata:
column = backend.df_engine.map_objects(column, lambda x: metadata["str2bool"][str(x)])
else:
# No predefined mapping from string to bool, so compute it directly
column = backend.df_engine.map_objects(column, strings_utils.str2bool)
proc_df[feature_config[PROC_COLUMN]] = column.astype(np.bool_)
return proc_df
class BinaryInputFeature(BinaryFeatureMixin, InputFeature):
def __init__(self, input_feature_config: BinaryInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
input_feature_config.encoder.input_size = self.input_shape[-1]
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs):
assert isinstance(inputs, torch.Tensor)
assert inputs.dtype in [torch.bool, torch.int64, torch.float32]
assert len(inputs.shape) == 1 or (len(inputs.shape) == 2 and inputs.shape[1] == 1)
if len(inputs.shape) == 1:
inputs = inputs[:, None]
# Inputs to the binary encoder could be of dtype torch.bool. Linear layer
# weights are of dtype torch.float32. The inputs and the weights need to
# be of the same dtype.
if inputs.dtype == torch.bool:
inputs = inputs.type(torch.float32)
encoder_outputs = self.encoder_obj(inputs)
return encoder_outputs
@property
def input_dtype(self):
return torch.bool
@property
def input_shape(self) -> torch.Size:
return torch.Size([1])
@property
def output_shape(self) -> torch.Size:
return self.encoder_obj.output_shape
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
pass
@staticmethod
def get_schema_cls():
return BinaryInputFeatureConfig
def create_sample_input(self, batch_size: int = 2):
return torch.rand([batch_size]) > 0.5
@classmethod
def get_preproc_input_dtype(cls, metadata: TrainingSetMetadataDict) -> str:
return "string" if metadata.get("str2bool") else "int32"
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _BinaryPreprocessing(metadata)
class BinaryOutputFeature(BinaryFeatureMixin, OutputFeature):
def __init__(
self,
output_feature_config: BinaryOutputFeatureConfig | dict,
output_features: dict[str, OutputFeature],
**kwargs,
):
self.threshold = output_feature_config.threshold
super().__init__(output_feature_config, output_features, **kwargs)
self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
self._setup_loss()
self._setup_metrics()
def logits(self, inputs, **kwargs):
hidden = inputs[HIDDEN]
return self.decoder_obj(hidden)
def create_calibration_module(self, feature: BinaryOutputFeatureConfig) -> torch.nn.Module:
"""Creates the appropriate calibration module based on the feature config.
Today, only one type of calibration ("temperature_scaling") is available, but more options may be supported in
the future.
"""
if feature.calibration:
calibration_cls = calibration.get_calibration_cls(BINARY, "temperature_scaling")
return calibration_cls(binary=True)
return None
def create_predict_module(self) -> PredictModule:
# A lot of code assumes output features have a prediction module, but if we are using a passthrough
# decoder then there is no threshold.
threshold = getattr(self, "threshold", 0.5)
return _BinaryPredict(threshold, calibration_module=self.calibration_module)
def get_prediction_set(self):
return {PREDICTIONS, PROBABILITIES, LOGITS}
@classmethod
def get_output_dtype(cls):
return torch.bool
@property
def output_shape(self) -> torch.Size:
return torch.Size([1])
@property
def input_shape(self) -> torch.Size:
return torch.Size([1])
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
pass
@staticmethod
def calculate_overall_stats(predictions, targets, train_set_metadata):
overall_stats = {}
confusion_matrix = ConfusionMatrix(targets, predictions[PREDICTIONS], labels=["False", "True"])
overall_stats["confusion_matrix"] = confusion_matrix.cm.tolist()
overall_stats["overall_stats"] = confusion_matrix.stats()
overall_stats["per_class_stats"] = confusion_matrix.per_class_stats()
fpr, tpr, thresholds = roc_curve(targets, predictions[PROBABILITIES])
overall_stats["roc_curve"] = {
"false_positive_rate": fpr.tolist(),
"true_positive_rate": tpr.tolist(),
}
overall_stats["roc_auc_macro"] = roc_auc_score(targets, predictions[PROBABILITIES], average="macro")
overall_stats["roc_auc_micro"] = roc_auc_score(targets, predictions[PROBABILITIES], average="micro")
ps, rs, thresholds = precision_recall_curve(targets, predictions[PROBABILITIES])
overall_stats["precision_recall_curve"] = {
"precisions": ps.tolist(),
"recalls": rs.tolist(),
}
overall_stats["average_precision_macro"] = average_precision_score(
targets, predictions[PROBABILITIES], average="macro"
)
overall_stats["average_precision_micro"] = average_precision_score(
targets, predictions[PROBABILITIES], average="micro"
)
overall_stats["average_precision_samples"] = average_precision_score(
targets, predictions[PROBABILITIES], average="samples"
)
return overall_stats
def postprocess_predictions(
self,
result,
metadata,
):
class_names = ["False", "True"]
if "bool2str" in metadata:
class_names = metadata["bool2str"]
predictions_col = f"{self.feature_name}_{PREDICTIONS}"
if predictions_col in result:
if "bool2str" in metadata:
result[predictions_col] = result[predictions_col].map(
lambda pred: metadata["bool2str"][pred],
)
probabilities_col = f"{self.feature_name}_{PROBABILITIES}"
if probabilities_col in result:
false_col = f"{probabilities_col}_{class_names[0]}"
true_col = f"{probabilities_col}_{class_names[1]}"
prob_col = f"{self.feature_name}_{PROBABILITY}"
result = result.assign(
**{
false_col: lambda x: 1 - x[probabilities_col],
true_col: lambda x: x[probabilities_col],
prob_col: np.where(
result[probabilities_col] > 0.5, result[probabilities_col], 1 - result[probabilities_col]
),
probabilities_col: result[probabilities_col].map(lambda x: [1 - x, x]),
},
)
return result
@staticmethod
def get_schema_cls():
return BinaryOutputFeatureConfig
@classmethod
def get_postproc_output_dtype(cls, metadata: TrainingSetMetadataDict) -> str:
return "string" if metadata.get("bool2str") else "int32"
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _BinaryPostprocessing(metadata)
def metric_kwargs(self) -> dict:
"""Returns arguments that are used to instantiate an instance of each metric class."""
return {"task": "binary"}
================================================
FILE: ludwig/features/category_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from typing import Any
import numpy as np
import torch
from ludwig.constants import (
CATEGORY,
CATEGORY_DISTRIBUTION,
COLUMN,
HIDDEN,
LOGITS,
NAME,
PREDICTIONS,
PREPROCESSING,
PROBABILITIES,
PROBABILITY,
PROC_COLUMN,
PROJECTION_INPUT,
)
from ludwig.error import InputDataError
from ludwig.features.base_feature import BaseFeatureMixin, InputFeature, OutputFeature, PredictModule
from ludwig.features.vector_feature import VectorFeatureMixin
from ludwig.schema.features.category_feature import (
CategoryDistributionOutputFeatureConfig,
CategoryInputFeatureConfig,
CategoryOutputFeatureConfig,
)
from ludwig.schema.features.loss.loss import CORNLossConfig
from ludwig.types import (
FeatureMetadataDict,
FeaturePostProcessingOutputDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils import calibration, output_feature_utils
from ludwig.utils.eval_utils import ConfusionMatrix
from ludwig.utils.math_utils import int_type, softmax
from ludwig.utils.strings_utils import create_vocabulary_single_token, UNKNOWN_SYMBOL
from ludwig.utils.types import TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
class _CategoryPreprocessing(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.str2idx = metadata["str2idx"]
if UNKNOWN_SYMBOL in self.str2idx:
self.unk = self.str2idx[UNKNOWN_SYMBOL]
else:
# self.unk is set to 0 to comply with Torchscript type tracing and will
# likely not be used during training, but potentially during inference
self.unk = 0
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
if not torch.jit.isinstance(v, list[str]):
raise ValueError(f"Unsupported input: {v}")
indices = [self.str2idx.get(s.strip(), self.unk) for s in v]
return torch.tensor(indices, dtype=torch.int32)
class _CategoryPostprocessing(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.idx2str = {i: v for i, v in enumerate(metadata["idx2str"])}
self.unk = UNKNOWN_SYMBOL
self.predictions_key = PREDICTIONS
self.probabilities_key = PROBABILITIES
def forward(self, preds: dict[str, torch.Tensor], feature_name: str) -> FeaturePostProcessingOutputDict:
predictions = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.predictions_key)
probabilities = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.probabilities_key)
inv_preds = [self.idx2str.get(pred, self.unk) for pred in predictions]
return {
self.predictions_key: inv_preds,
self.probabilities_key: probabilities,
}
class _CategoryPredict(PredictModule):
def __init__(self, calibration_module=None, use_cumulative_probs=False):
super().__init__()
self.calibration_module = calibration_module
# Derive the label from the cumulative probability distribution of the ordered category logits.
# Taken from CORN loss implementation:
# https://github.com/Raschka-research-group/coral-pytorch/blob/main/coral_pytorch/dataset.py#L123
self.use_cumulative_probs = use_cumulative_probs
def forward(self, inputs: dict[str, torch.Tensor], feature_name: str) -> dict[str, torch.Tensor]:
logits = output_feature_utils.get_output_feature_tensor(inputs, feature_name, self.logits_key)
if self.use_cumulative_probs:
if self.calibration_module is not None:
probabilities = self.calibration_module(logits)
else:
probabilities = torch.sigmoid(logits)
probabilities = torch.cumprod(probabilities, dim=1)
predict_levels = probabilities > 0.5
predictions = torch.sum(predict_levels, dim=1)
else:
if self.calibration_module is not None:
probabilities = self.calibration_module(logits)
else:
probabilities = torch.softmax(logits, -1)
predictions = torch.argmax(probabilities, -1)
predictions = predictions.long()
# EXPECTED SHAPE OF RETURNED TENSORS
# predictions: [batch_size]
# probabilities: [batch_size, num_classes]
# logits: [batch_size, num_classes]
return {self.predictions_key: predictions, self.probabilities_key: probabilities, self.logits_key: logits}
class CategoryFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return CATEGORY
@staticmethod
def cast_column(column, backend):
return column.astype(str)
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
idx2str, str2idx, str2freq = create_vocabulary_single_token(
column,
num_most_frequent=preprocessing_parameters["most_common"],
processor=backend.df_engine,
)
if "vocab" in preprocessing_parameters and preprocessing_parameters["vocab"]: # Check that vocab is non-empty
# If vocab was explciitly provided, override the inferred vocab
idx2str = preprocessing_parameters["vocab"]
str2idx = {s: i for i, s in enumerate(idx2str)}
str2freq = {k: str2freq.get(k, 0) for k in idx2str}
if "fallback_label" in preprocessing_parameters:
# This is a category output feature for LLMs
# Check if the fallback label is in the vocab, if not add it.
if preprocessing_parameters["fallback_label"] not in str2idx:
str2idx[preprocessing_parameters["fallback_label"]] = len(str2idx)
idx2str.append(preprocessing_parameters["fallback_label"])
str2freq[preprocessing_parameters["fallback_label"]] = 0
vocab_size = len(str2idx)
if not is_input_feature and vocab_size <= 1:
# Category output feature with vocab size 1
raise InputDataError(
column.name,
CATEGORY,
f"""
At least 2 distinct values are required for category output features, but column
only contains {str(idx2str)}.
""",
)
if vocab_size <= 1:
# Category input feature with vocab size 1
logger.info(
f"Input feature '{column.name}' contains only 1 distinct value {str(idx2str)}. This is not useful"
" for machine learning models because this feature has zero variance. Consider removing this feature"
" from your input features."
)
return {"idx2str": idx2str, "str2idx": str2idx, "str2freq": str2freq, "vocab_size": vocab_size}
@staticmethod
def feature_data(backend, column, metadata):
def __replace_token_with_idx(value: Any, metadata: TrainingSetMetadataDict, fallback_symbol_idx: int) -> int:
stripped_value = value.strip()
if stripped_value in metadata["str2idx"]:
return metadata["str2idx"][stripped_value]
logger.warning(f"""
Encountered unknown symbol '{stripped_value}' for '{column.name}' during category
feature preprocessing. This should never happen during training. If this happens during
inference, this may be an indication that not all possible symbols were present in your
training set. Consider re-splitting your data to ensure full representation, or setting
preprocessing.most_common parameter to be smaller than this feature's total vocabulary
size, {len(metadata["str2idx"])}, which will ensure that the model is architected and
trained with an UNKNOWN symbol. Returning the index for the most frequent symbol,
{metadata["idx2str"][fallback_symbol_idx]}, instead.
""")
return fallback_symbol_idx
# No unknown symbol in Metadata from preprocessing means that all values
# should be mappable to vocabulary
if UNKNOWN_SYMBOL not in metadata["str2idx"]:
# If no unknown is defined, just use the most popular token's index as the fallback index
most_popular_token = max(metadata["str2freq"], key=metadata["str2freq"].get)
most_popular_token_idx = metadata["str2idx"].get(most_popular_token)
return backend.df_engine.map_objects(
column,
lambda x: __replace_token_with_idx(x, metadata, most_popular_token_idx),
meta=(column.name, int),
).astype(int_type(metadata["vocab_size"]))
else:
return backend.df_engine.map_objects(
column,
lambda x: (
metadata["str2idx"][x.strip()]
if x.strip() in metadata["str2idx"]
else metadata["str2idx"][UNKNOWN_SYMBOL]
),
meta=(column.name, int),
).astype(int_type(metadata["vocab_size"]))
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
proc_df[feature_config[PROC_COLUMN]] = CategoryFeatureMixin.feature_data(
backend,
input_df[feature_config[COLUMN]],
metadata[feature_config[NAME]],
)
return proc_df
class CategoryDistributionFeatureMixin(VectorFeatureMixin):
@staticmethod
def type():
return CATEGORY_DISTRIBUTION
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
idx2str = preprocessing_parameters["vocab"]
str2idx = {s: i for i, s in enumerate(idx2str)}
return {
"preprocessing": preprocessing_parameters,
"idx2str": idx2str,
"str2idx": str2idx,
"vocab_size": len(idx2str),
}
class CategoryInputFeature(CategoryFeatureMixin, InputFeature):
def __init__(self, input_feature_config: CategoryInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs):
assert isinstance(inputs, torch.Tensor)
assert inputs.dtype in (torch.int8, torch.int16, torch.int32, torch.int64)
assert len(inputs.shape) == 1 or (len(inputs.shape) == 2 and inputs.shape[1] == 1)
inputs = inputs.reshape(-1, 1)
if inputs.dtype == torch.int8 or inputs.dtype == torch.int16:
inputs = inputs.type(torch.int)
encoder_output = self.encoder_obj(inputs)
return encoder_output
@property
def input_dtype(self):
return torch.int32
@property
def input_shape(self) -> torch.Size:
return torch.Size([1])
@property
def output_shape(self) -> torch.Size:
return torch.Size(self.encoder_obj.output_shape)
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.encoder.vocab = feature_metadata["idx2str"]
feature_config.encoder.skip = feature_metadata[PREPROCESSING].get("cache_encoder_embeddings", False)
@staticmethod
def get_schema_cls():
return CategoryInputFeatureConfig
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _CategoryPreprocessing(metadata)
class CategoryOutputFeature(CategoryFeatureMixin, OutputFeature):
def __init__(
self,
output_feature_config: CategoryOutputFeatureConfig | dict,
output_features: dict[str, OutputFeature],
**kwargs,
):
self.num_classes = output_feature_config.num_classes
self.top_k = output_feature_config.top_k
# TODO(travis): make this more general to other cumulative loss functions
self.use_cumulative_probs = isinstance(output_feature_config.loss, CORNLossConfig)
super().__init__(output_feature_config, output_features, **kwargs)
if hasattr(output_feature_config.decoder, "num_classes"):
output_feature_config.decoder.num_classes = output_feature_config.num_classes
self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
self._setup_loss()
self._setup_metrics()
def logits(self, inputs, **kwargs): # hidden
hidden = inputs[HIDDEN]
# EXPECTED SHAPES FOR RETURNED TENSORS
# logits: shape [batch_size, num_classes]
# hidden: shape [batch_size, size of final fully connected layer]
return {LOGITS: self.decoder_obj(hidden), PROJECTION_INPUT: hidden}
def create_calibration_module(self, feature: CategoryOutputFeatureConfig) -> torch.nn.Module:
"""Creates the appropriate calibration module based on the feature config.
Today, only one type of calibration ("temperature_scaling") is available, but more options may be supported in
the future.
"""
if feature.calibration:
calibration_cls = calibration.get_calibration_cls(CATEGORY, "temperature_scaling")
return calibration_cls(num_classes=self.num_classes)
return None
def create_predict_module(self) -> PredictModule:
return _CategoryPredict(
calibration_module=self.calibration_module, use_cumulative_probs=self.use_cumulative_probs
)
def get_prediction_set(self):
return {PREDICTIONS, PROBABILITIES, LOGITS}
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@classmethod
def get_output_dtype(cls):
return torch.int64
@property
def output_shape(self) -> torch.Size:
return torch.Size([1])
def metric_kwargs(self):
return {"top_k": self.top_k, "num_classes": self.num_classes}
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.num_classes = feature_metadata["vocab_size"]
feature_config.top_k = min(feature_config.num_classes, feature_config.top_k)
# If labels are provided, then this is a classification task for LLMs
if hasattr(feature_config.preprocessing, "vocab"):
# Enrich the feature config's decoder with str2idx
feature_config.decoder.str2idx = feature_metadata["str2idx"]
if isinstance(feature_config.loss.class_weights, (list, tuple)):
if len(feature_config.loss.class_weights) != feature_config.num_classes:
raise ValueError(
f"The length of class_weights ({len(feature_config.loss.class_weights)}) is not compatible with "
f"the number of classes ({feature_config.num_classes}) for feature {feature_config.column}. "
"Check the metadata JSON file to see the classes "
"and their order and consider there needs to be a weight "
"for the class too."
)
if isinstance(feature_config.loss.class_weights, dict):
if feature_metadata["str2idx"].keys() != feature_config.loss.class_weights.keys():
raise ValueError(
f"The class_weights keys ({feature_config.loss.class_weights.keys()}) are not compatible with "
f'the classes ({feature_metadata["str2idx"].keys()}) of feature {feature_config.column}. '
"Check the metadata JSON file to see the classes "
"and consider there needs to be a weight "
"for the class too."
)
else:
class_weights = feature_config.loss.class_weights
idx2str = feature_metadata["idx2str"]
class_weights_list = [class_weights[s] for s in idx2str]
feature_config.loss.class_weights = class_weights_list
if feature_config.loss.class_similarities_temperature > 0:
if feature_config.loss.class_similarities is not None:
similarities = feature_config.loss.class_similarities
temperature = feature_config.loss.class_similarities_temperature
curr_row = 0
first_row_length = 0
is_first_row = True
for row in similarities:
if is_first_row:
first_row_length = len(row)
is_first_row = False
curr_row += 1
else:
curr_row_length = len(row)
if curr_row_length != first_row_length:
raise ValueError(
"The length of row {} of the class_similarities "
"of {} is {}, different from the length of "
"the first row {}. All rows must have "
"the same length.".format(
curr_row, feature_config.column, curr_row_length, first_row_length
)
)
else:
curr_row += 1
all_rows_length = first_row_length
if all_rows_length != len(similarities):
raise ValueError(
"The class_similarities matrix of {} has "
"{} rows and {} columns, "
"their number must be identical.".format(
feature_config.column, len(similarities), all_rows_length
)
)
if all_rows_length != feature_config.num_classes:
raise ValueError(
f"The size of the class_similarities matrix of {feature_config.column} is "
f"{all_rows_length}, different from the number of classes ({feature_config.num_classes}). "
"Check the metadata JSON file to see the classes "
"and their order and "
"consider class too."
)
similarities = np.array(similarities, dtype=np.float32)
for i in range(len(similarities)):
similarities[i, :] = softmax(similarities[i, :], temperature=temperature)
feature_config.loss.class_similarities = similarities
else:
raise ValueError(
"class_similarities_temperature > 0, "
"but no class_similarities are provided "
"for feature {}".format(feature_config.column)
)
@staticmethod
def calculate_overall_stats(predictions, targets, train_set_metadata):
overall_stats = {}
confusion_matrix = ConfusionMatrix(targets, predictions[PREDICTIONS], labels=train_set_metadata["idx2str"])
overall_stats["confusion_matrix"] = confusion_matrix.cm.tolist()
overall_stats["overall_stats"] = confusion_matrix.stats()
overall_stats["per_class_stats"] = confusion_matrix.per_class_stats()
return overall_stats
def postprocess_predictions(
self,
predictions,
metadata,
):
predictions_col = f"{self.feature_name}_{PREDICTIONS}"
if predictions_col in predictions:
if "idx2str" in metadata:
predictions[predictions_col] = predictions[predictions_col].map(lambda pred: metadata["idx2str"][pred])
probabilities_col = f"{self.feature_name}_{PROBABILITIES}"
if probabilities_col in predictions:
prob_col = f"{self.feature_name}_{PROBABILITY}"
predictions[prob_col] = predictions[probabilities_col].map(max)
predictions[probabilities_col] = predictions[probabilities_col].map(lambda pred: pred.tolist())
if "idx2str" in metadata:
for i, label in enumerate(metadata["idx2str"]):
key = f"{probabilities_col}_{label}"
# Use default param to force a capture before the loop completes, see:
# https://stackoverflow.com/questions/2295290/what-do-lambda-function-closures-capture
predictions[key] = predictions[probabilities_col].map(
lambda prob, i=i: prob[i],
)
top_k_col = f"{self.feature_name}_predictions_top_k"
if top_k_col in predictions:
if "idx2str" in metadata:
predictions[top_k_col] = predictions[top_k_col].map(
lambda pred_top_k: [metadata["idx2str"][pred] for pred in pred_top_k]
)
return predictions
@staticmethod
def get_schema_cls():
return CategoryOutputFeatureConfig
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _CategoryPostprocessing(metadata)
class CategoryDistributionOutputFeature(CategoryDistributionFeatureMixin, CategoryOutputFeature):
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@classmethod
def get_output_dtype(cls):
return torch.float32
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.num_classes])
@staticmethod
def get_schema_cls():
return CategoryDistributionOutputFeatureConfig
================================================
FILE: ludwig/features/date_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from datetime import date, datetime
import numpy as np
import torch
from ludwig.constants import COLUMN, DATE, PROC_COLUMN
from ludwig.features.base_feature import BaseFeatureMixin, InputFeature
from ludwig.schema.features.date_feature import DateInputFeatureConfig
from ludwig.types import (
FeatureConfigDict,
FeatureMetadataDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils.date_utils import create_vector_from_datetime_obj, parse_datetime
from ludwig.utils.types import DataFrame, TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
DATE_VECTOR_LENGTH = 9
class _DatePreprocessing(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
if torch.jit.isinstance(v, list[torch.Tensor]):
v = torch.stack(v)
if torch.jit.isinstance(v, torch.Tensor):
return v.to(dtype=torch.int)
else:
raise ValueError(f"Unsupported input: {v}")
class DateFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return DATE
@staticmethod
def cast_column(column, backend):
return column
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
return {"preprocessing": preprocessing_parameters}
@staticmethod
def date_to_list(date_value, datetime_format, preprocessing_parameters):
try:
if isinstance(date_value, datetime):
datetime_obj = date_value
elif isinstance(date_value, date):
datetime_obj = datetime.combine(date=date_value, time=datetime.min.time())
elif isinstance(date_value, str) and datetime_format is not None:
try:
datetime_obj = datetime.strptime(date_value, datetime_format)
except ValueError:
datetime_obj = parse_datetime(date_value)
else:
datetime_obj = parse_datetime(date_value)
except Exception as e:
logger.error(
f"Error parsing date: '{date_value}' with error '{e}' "
"Please provide a datetime format that parses it "
"in the preprocessing section of the date feature "
"in the config. "
"The preprocessing fill in value will be used."
"For more details: "
"https://ludwig-ai.github.io/ludwig-docs/latest/configuration/features/date_features/#date-features-preprocessing" # noqa
)
fill_value = preprocessing_parameters["fill_value"]
if fill_value != "":
datetime_obj = parse_datetime(fill_value)
else:
datetime_obj = datetime.now()
return create_vector_from_datetime_obj(datetime_obj)
@staticmethod
def add_feature_data(
feature_config: FeatureConfigDict,
input_df: DataFrame,
proc_df: dict[str, DataFrame],
metadata: TrainingSetMetadataDict,
preprocessing_parameters: PreprocessingConfigDict,
backend, # Union[Backend, str]
skip_save_processed_input: bool,
) -> None:
datetime_format = preprocessing_parameters["datetime_format"]
proc_df[feature_config[PROC_COLUMN]] = backend.df_engine.map_objects(
input_df[feature_config[COLUMN]],
lambda x: np.array(
DateFeatureMixin.date_to_list(x, datetime_format, preprocessing_parameters), dtype=np.int32
),
)
return proc_df
class DateInputFeature(DateFeatureMixin, InputFeature):
def __init__(self, input_feature_config: DateInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs):
assert isinstance(inputs, torch.Tensor), type(inputs)
assert inputs.dtype in [torch.int16, torch.int32, torch.int64, torch.float32], inputs.dtype
inputs_encoded = self.encoder_obj(inputs)
return inputs_encoded
@property
def input_dtype(self):
return torch.int32
@property
def input_shape(self) -> torch.Size:
return torch.Size([DATE_VECTOR_LENGTH])
@property
def output_shape(self) -> torch.Size:
return self.encoder_obj.output_shape
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
pass
def create_sample_input(self, batch_size: int = 2):
date = [2013, 2, 26, 1, 57, 0, 0, 0, 0]
return torch.Tensor([date for _ in range(batch_size)]).type(torch.int32)
@staticmethod
def get_schema_cls():
return DateInputFeatureConfig
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _DatePreprocessing(metadata)
================================================
FILE: ludwig/features/feature_registries.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from typing import Any, TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
AUDIO,
BAG,
BINARY,
CATEGORY,
CATEGORY_DISTRIBUTION,
DATE,
H3,
IMAGE,
NUMBER,
SEQUENCE,
SET,
TEXT,
TIMESERIES,
VECTOR,
)
from ludwig.features.audio_feature import AudioFeatureMixin, AudioInputFeature
from ludwig.features.bag_feature import BagFeatureMixin, BagInputFeature
from ludwig.features.binary_feature import BinaryFeatureMixin, BinaryInputFeature, BinaryOutputFeature
from ludwig.features.category_feature import (
CategoryDistributionFeatureMixin,
CategoryDistributionOutputFeature,
CategoryFeatureMixin,
CategoryInputFeature,
CategoryOutputFeature,
)
from ludwig.features.date_feature import DateFeatureMixin, DateInputFeature
from ludwig.features.h3_feature import H3FeatureMixin, H3InputFeature
from ludwig.features.image_feature import ImageFeatureMixin, ImageInputFeature, ImageOutputFeature
from ludwig.features.number_feature import NumberFeatureMixin, NumberInputFeature, NumberOutputFeature
from ludwig.features.sequence_feature import SequenceFeatureMixin, SequenceInputFeature, SequenceOutputFeature
from ludwig.features.set_feature import SetFeatureMixin, SetInputFeature, SetOutputFeature
from ludwig.features.text_feature import TextFeatureMixin, TextInputFeature, TextOutputFeature
from ludwig.features.timeseries_feature import TimeseriesFeatureMixin, TimeseriesInputFeature, TimeseriesOutputFeature
from ludwig.features.vector_feature import VectorFeatureMixin, VectorInputFeature, VectorOutputFeature
from ludwig.utils.misc_utils import get_from_registry
if TYPE_CHECKING:
from ludwig.models.base import BaseModel
from ludwig.schema.model_types.base import ModelConfig
@DeveloperAPI
def get_base_type_registry() -> dict:
return {
TEXT: TextFeatureMixin,
CATEGORY: CategoryFeatureMixin,
SET: SetFeatureMixin,
BAG: BagFeatureMixin,
BINARY: BinaryFeatureMixin,
NUMBER: NumberFeatureMixin,
SEQUENCE: SequenceFeatureMixin,
TIMESERIES: TimeseriesFeatureMixin,
IMAGE: ImageFeatureMixin,
AUDIO: AudioFeatureMixin,
H3: H3FeatureMixin,
DATE: DateFeatureMixin,
VECTOR: VectorFeatureMixin,
CATEGORY_DISTRIBUTION: CategoryDistributionFeatureMixin,
}
@DeveloperAPI
def get_input_type_registry() -> dict:
return {
TEXT: TextInputFeature,
NUMBER: NumberInputFeature,
BINARY: BinaryInputFeature,
CATEGORY: CategoryInputFeature,
SET: SetInputFeature,
SEQUENCE: SequenceInputFeature,
IMAGE: ImageInputFeature,
AUDIO: AudioInputFeature,
TIMESERIES: TimeseriesInputFeature,
BAG: BagInputFeature,
H3: H3InputFeature,
DATE: DateInputFeature,
VECTOR: VectorInputFeature,
}
@DeveloperAPI
def get_output_type_registry() -> dict:
return {
CATEGORY: CategoryOutputFeature,
BINARY: BinaryOutputFeature,
NUMBER: NumberOutputFeature,
SEQUENCE: SequenceOutputFeature,
SET: SetOutputFeature,
TEXT: TextOutputFeature,
TIMESERIES: TimeseriesOutputFeature,
VECTOR: VectorOutputFeature,
CATEGORY_DISTRIBUTION: CategoryDistributionOutputFeature,
IMAGE: ImageOutputFeature,
}
def update_config_with_metadata(config_obj: "ModelConfig", training_set_metadata: dict[str, Any]):
# populate input features fields depending on data
for input_feature in config_obj.input_features:
feature = get_from_registry(input_feature.type, get_input_type_registry())
feature.update_config_with_metadata(input_feature, training_set_metadata[input_feature.name])
# populate output features fields depending on data
for output_feature in config_obj.output_features:
feature = get_from_registry(output_feature.type, get_output_type_registry())
feature.update_config_with_metadata(output_feature, training_set_metadata[output_feature.name])
def update_config_with_model(config_obj: "ModelConfig", model: "BaseModel"):
"""Updates the config with the final input feature params given a model.
This function should only be called to update the config after the model is initialized. Currently only implemented
for input features because it is only relevant for HuggingFace text encoders. HuggingFace text encoders only know
their final config after class initialization.
"""
for input_feature in config_obj.input_features:
model_input_feature = model.input_features.get(input_feature.name)
model_input_feature.update_config_after_module_init(input_feature)
================================================
FILE: ludwig/features/feature_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import re
import numpy as np
import torch
from ludwig.constants import NAME, PREPROCESSING, SEQUENCE, TEXT, TIMESERIES, TYPE
from ludwig.utils.data_utils import hash_dict
from ludwig.utils.strings_utils import get_tokenizer_from_registry, UNKNOWN_SYMBOL
SEQUENCE_TYPES = {SEQUENCE, TEXT, TIMESERIES}
FEATURE_NAME_SUFFIX = "__ludwig"
FEATURE_NAME_SUFFIX_LENGTH = len(FEATURE_NAME_SUFFIX)
def should_regularize(regularize_layers):
regularize = False
if isinstance(regularize_layers, bool) and regularize_layers:
regularize = True
elif isinstance(regularize_layers, (list, tuple)) and regularize_layers and regularize_layers[-1]:
regularize = True
return regularize
def set_str_to_idx(set_string, feature_dict, tokenizer_name):
try:
tokenizer = get_tokenizer_from_registry(tokenizer_name)()
except ValueError:
raise Exception(f"Tokenizer {tokenizer_name} not supported")
out = [feature_dict.get(item, feature_dict[UNKNOWN_SYMBOL]) for item in tokenizer(set_string)]
return np.array(out, dtype=np.int32)
def compute_token_probabilities(
probabilities: list | tuple | np.ndarray,
) -> np.ndarray:
"""Gets the maximum probability per timestep.
Args:
probabilities: An iterable of iterables or np.ndarray with shape (sequence_length, num_classes)
where each inner iterable or np.ndarray is the probability distribution for a single timestep.
Returns:
An np.ndarray with shape (sequence_length,) containing the maximum probability for each timestep.
"""
if isinstance(probabilities, (list, tuple)):
if not hasattr(probabilities[0], "__len__"):
raise ValueError(
"Received token probabilities as a flat 1D list. Expected list of list of probabilities "
"(sequence_length, vocab_size)."
)
max_probs = []
for timestep_probs in probabilities:
max_probs.append(np.max(timestep_probs))
max_probs = np.array(max_probs)
elif isinstance(probabilities, np.ndarray):
if len(probabilities.shape) != 2:
raise ValueError(
f"Received token probabilities with non 2D shape: {probabilities.shape}. Expected shape: "
"(sequence_length, vocab_size)."
)
max_probs = np.max(probabilities, axis=-1)
else:
raise ValueError(f"probabilities type must be in [list, tuple, np.ndarray]. Got {type(probabilities)}")
return max_probs
def compute_sequence_probability(
sequence_probabilities: np.ndarray,
max_sequence_length: int | None = None,
return_log_prob: bool = True,
) -> float:
"""Computes the sequence level probability.
Args:
sequence_probabilities: An iterable of iterables or np.ndarray with shape (sequence_length,)
max_sequence_length: The maximum sequence length to use. If None, uses the first dim of `sequence_probabilities`
return_log_prob: Whether to return the log probability. Defaults to True.
"""
if max_sequence_length is None:
max_sequence_length = sequence_probabilities.shape[0]
sequence_probabilities = sequence_probabilities[:max_sequence_length]
if return_log_prob:
return np.sum(np.log(np.clip(sequence_probabilities, 1e-10, 1.0)))
else:
return np.prod(sequence_probabilities)
def sanitize(name):
"""Replaces invalid id characters."""
return re.sub("\\W|^(?=\\d)", "_", name)
def compute_feature_hash(feature: dict) -> str:
"""This function computes a hash for each feature based on the preprocessing dictionary associated with each
feature, as well as the feature's type.
Args:
feature: Feature dictionary
Returns: Feature hash name
"""
feature_data = dict(
preprocessing=feature.get(PREPROCESSING, {}),
type=feature[TYPE],
)
return sanitize(feature[NAME]) + "_" + hash_dict(feature_data).decode("ascii")
def get_input_size_with_dependencies(
combiner_output_size: int, dependencies: list[str], other_output_features # Dict[str, "OutputFeature"]
):
"""Returns the input size for the first layer of this output feature's FC stack, accounting for dependencies on
other output features.
In the forward pass, the hidden states of any dependent output features get concatenated with the combiner's output.
If this output feature depends on other output features, then the input size for this feature's FCStack is the sum
of the output sizes of other output features + the combiner's output size.
"""
input_size_with_dependencies = combiner_output_size
for feature_name in dependencies:
if other_output_features[feature_name].fc_stack.num_layers:
input_size_with_dependencies += other_output_features[feature_name].fc_stack.output_shape[-1]
else:
# 0-layer FCStack. Use the output feature's input size.
input_size_with_dependencies += other_output_features[feature_name].input_size
return input_size_with_dependencies
def get_module_dict_key_from_name(name: str, feature_name_suffix: str = FEATURE_NAME_SUFFIX) -> str:
"""Returns a key that's guaranteed to be compatible with torch."""
key = name.replace(".", "__ludwig_punct_period__")
return key + feature_name_suffix
def get_name_from_module_dict_key(key: str, feature_name_suffix_length: int = FEATURE_NAME_SUFFIX_LENGTH) -> str:
"""Reverse of get_module_dict_key_from_name."""
name = key.replace("__ludwig_punct_period__", ".")
return name[:-feature_name_suffix_length]
class LudwigFeatureDict(torch.nn.Module):
"""Torch ModuleDict wrapper that permits keys with any name.
Torch's ModuleDict implementation doesn't allow certain keys to be used if they conflict with existing class
attributes, e.g.
> torch.nn.ModuleDict({'type': torch.nn.Module()}) # Raises KeyError.
This class is a simple wrapper around torch's ModuleDict that mitigates possible conflicts by using a key-suffixing
protocol.
This is also tracked in Pytorch: https://github.com/pytorch/pytorch/issues/71203.
"""
def __init__(self):
super().__init__()
self.module_dict = torch.nn.ModuleDict()
self.internal_key_to_original_name_map = {}
def get(self, key) -> torch.nn.Module:
return self.module_dict[get_module_dict_key_from_name(key)]
def set(self, key: str, module: torch.nn.Module) -> None:
module_dict_key_name = get_module_dict_key_from_name(key)
self.internal_key_to_original_name_map[module_dict_key_name] = key
self.module_dict[module_dict_key_name] = module
def __len__(self) -> int:
return len(self.module_dict)
def __next__(self) -> None:
return next(iter(self))
def __iter__(self) -> None:
return iter(self.keys())
def keys(self) -> list[str]:
return [
get_name_from_module_dict_key(feature_name)
for feature_name in self.internal_key_to_original_name_map.keys()
]
def values(self) -> list[torch.nn.Module]:
return [module for _, module in self.module_dict.items()]
def items(self) -> list[tuple[str, torch.nn.Module]]:
return [
(get_name_from_module_dict_key(feature_name), module) for feature_name, module in self.module_dict.items()
]
def update(self, modules: dict[str, torch.nn.Module]) -> None:
for feature_name, module in modules.items():
self.set(feature_name, module)
================================================
FILE: ludwig/features/h3_feature.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import numpy as np
import torch
from ludwig.constants import COLUMN, H3, PROC_COLUMN
from ludwig.features.base_feature import BaseFeatureMixin, InputFeature
from ludwig.schema.features.h3_feature import H3InputFeatureConfig
from ludwig.types import FeatureMetadataDict, ModelConfigDict, PreprocessingConfigDict, TrainingSetMetadataDict
from ludwig.utils.h3_util import h3_to_components
from ludwig.utils.types import TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
MAX_H3_RESOLUTION = 15
H3_VECTOR_LENGTH = MAX_H3_RESOLUTION + 4
H3_PADDING_VALUE = 7
class _H3Preprocessing(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.max_h3_resolution = MAX_H3_RESOLUTION
self.h3_padding_value = H3_PADDING_VALUE
self.computed_fill_value = float(metadata["preprocessing"]["computed_fill_value"])
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
if torch.jit.isinstance(v, list[torch.Tensor]):
v = torch.stack(v)
if not torch.jit.isinstance(v, torch.Tensor):
raise ValueError(f"Unsupported input: {v}")
v = torch.nan_to_num(v, nan=self.computed_fill_value)
v = v.long()
outputs: list[torch.Tensor] = []
for v_i in v:
components = h3_to_components(v_i)
header: list[int] = [
components.mode,
components.edge,
components.resolution,
components.base_cell,
]
cells_padding: list[int] = [self.h3_padding_value] * (self.max_h3_resolution - len(components.cells))
output = torch.tensor(header + components.cells + cells_padding, dtype=torch.uint8, device=v.device)
outputs.append(output)
return torch.stack(outputs)
class H3FeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return H3
@staticmethod
def cast_column(column, backend):
try:
return column.astype(int)
except ValueError:
logger.warning("H3Feature could not be read as int directly. Reading as float and converting to int.")
return column.astype(float).astype(int)
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
return {}
@staticmethod
def h3_to_list(h3_int):
components = h3_to_components(h3_int)
header = [components.mode, components.edge, components.resolution, components.base_cell]
cells_padding = [H3_PADDING_VALUE] * (MAX_H3_RESOLUTION - len(components.cells))
return header + components.cells + cells_padding
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
column = input_df[feature_config[COLUMN]]
if column.dtype == object:
column = backend.df_engine.map_objects(column, int)
column = backend.df_engine.map_objects(column, H3FeatureMixin.h3_to_list)
proc_df[feature_config[PROC_COLUMN]] = backend.df_engine.map_objects(
column, lambda x: np.array(x, dtype=np.uint8)
)
return proc_df
class H3InputFeature(H3FeatureMixin, InputFeature):
def __init__(self, input_feature_config: H3InputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs):
assert isinstance(inputs, torch.Tensor)
assert inputs.dtype in [torch.uint8, torch.int64]
assert len(inputs.shape) == 2
inputs_encoded = self.encoder_obj(inputs)
return inputs_encoded
@property
def input_dtype(self):
return torch.uint8
@property
def input_shape(self) -> torch.Size:
return torch.Size([H3_VECTOR_LENGTH])
@property
def output_shape(self) -> torch.Size:
return self.encoder_obj.output_shape
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
pass
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _H3Preprocessing(metadata)
@staticmethod
def get_schema_cls():
return H3InputFeatureConfig
================================================
FILE: ludwig/features/image_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import os
import warnings
from collections import Counter
from collections.abc import Callable
from dataclasses import dataclass
from functools import partial
from typing import Any
import numpy as np
import torch
from torchvision import transforms
from torchvision.transforms import functional as F
from torchvision.transforms.functional import normalize
from ludwig.constants import (
CHECKSUM,
COLUMN,
ENCODER,
HEIGHT,
IMAGE,
IMAGENET1K,
INFER_IMAGE_DIMENSIONS,
INFER_IMAGE_MAX_HEIGHT,
INFER_IMAGE_MAX_WIDTH,
INFER_IMAGE_NUM_CLASSES,
INFER_IMAGE_SAMPLE_SIZE,
LOGITS,
NAME,
NUM_CHANNELS,
PREDICTIONS,
PREPROCESSING,
PROC_COLUMN,
REQUIRES_EQUAL_DIMENSIONS,
SRC,
TRAINING,
TYPE,
WIDTH,
)
from ludwig.data.cache.types import wrap
from ludwig.encoders.base import Encoder
from ludwig.encoders.image.torchvision import TVModelVariant
from ludwig.features.base_feature import BaseFeatureMixin, InputFeature, OutputFeature, PredictModule
from ludwig.schema.features.augmentation.base import BaseAugmentationConfig
from ludwig.schema.features.augmentation.image import (
AutoAugmentationConfig,
RandomBlurConfig,
RandomBrightnessConfig,
RandomContrastConfig,
RandomHorizontalFlipConfig,
RandomRotateConfig,
RandomVerticalFlipConfig,
)
from ludwig.schema.features.image_feature import ImageInputFeatureConfig, ImageOutputFeatureConfig
from ludwig.types import (
FeatureMetadataDict,
FeaturePostProcessingOutputDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils import output_feature_utils
from ludwig.utils.augmentation_utils import get_augmentation_op, register_augmentation_op
from ludwig.utils.data_utils import get_abs_path
from ludwig.utils.dataframe_utils import is_dask_series_or_df
from ludwig.utils.fs_utils import has_remote_protocol, upload_h5
from ludwig.utils.image_utils import (
get_class_mask_from_image,
get_gray_default_image,
get_image_from_class_mask,
get_unique_channels,
grayscale,
num_channels_in_image,
read_image_from_bytes_obj,
read_image_from_path,
resize_image,
ResizeChannels,
torchvision_model_registry,
)
from ludwig.utils.misc_utils import set_default_value
from ludwig.utils.types import Series, TorchscriptPreprocessingInput
# constants used for Ludwig image preprocessing
IMAGENET1K_MEAN = [0.485, 0.456, 0.406]
IMAGENET1K_STD = [0.229, 0.224, 0.225]
logger = logging.getLogger(__name__)
###
# Image specific augmentation operations
###
@register_augmentation_op(name="auto_augmentation", features=IMAGE)
class AutoAugment(torch.nn.Module):
def __init__(self, config: AutoAugmentationConfig):
super().__init__()
self.auto_augmentation_method = config.method
self.augmentation_method = self.get_augmentation_method()
def get_augmentation_method(self):
if self.auto_augmentation_method == "trivial_augment":
return transforms.TrivialAugmentWide()
if self.auto_augmentation_method == "auto_augment":
return transforms.AutoAugment()
if self.auto_augmentation_method == "rand_augment":
return transforms.RandAugment()
raise ValueError(f"Unsupported auto-augmentation method: {self.auto_augmentation_method}")
def forward(self, imgs: torch.Tensor) -> torch.Tensor:
method = self.augmentation_method
uint8imgs = imgs.to(torch.uint8)
augmented_imgs = method(uint8imgs)
return augmented_imgs.to(torch.float32)
@register_augmentation_op(name="random_vertical_flip", features=IMAGE)
class RandomVFlip(torch.nn.Module):
def __init__(
self,
config: RandomVerticalFlipConfig,
):
super().__init__()
def forward(self, imgs):
if torch.rand(1) < 0.5:
imgs = F.vflip(imgs)
return imgs
@register_augmentation_op(name="random_horizontal_flip", features=IMAGE)
class RandomHFlip(torch.nn.Module):
def __init__(
self,
config: RandomHorizontalFlipConfig,
):
super().__init__()
def forward(self, imgs):
if torch.rand(1) < 0.5:
imgs = F.hflip(imgs)
return imgs
@register_augmentation_op(name="random_rotate", features=IMAGE)
class RandomRotate(torch.nn.Module):
def __init__(self, config: RandomRotateConfig):
super().__init__()
self.degree = config.degree
def forward(self, imgs):
if torch.rand(1) < 0.5:
# map angle to interval (-degree, +degree)
angle = (torch.rand(1) * 2 * self.degree - self.degree).item()
return F.rotate(imgs, angle)
else:
return imgs
@register_augmentation_op(name="random_contrast", features=IMAGE)
class RandomContrast(torch.nn.Module):
def __init__(self, config: RandomContrastConfig):
super().__init__()
self.min_contrast = config.min
self.contrast_adjustment_range = config.max - config.min
def forward(self, imgs):
if torch.rand(1) < 0.5:
# random contrast adjustment
adjust_factor = (torch.rand(1) * self.contrast_adjustment_range + self.min_contrast).item()
return F.adjust_contrast(imgs, adjust_factor)
else:
return imgs
@register_augmentation_op(name="random_brightness", features=IMAGE)
class RandomBrightness(torch.nn.Module):
def __init__(self, config: RandomBrightnessConfig):
super().__init__()
self.min_brightness = config.min
self.brightness_adjustment_range = config.max - config.min
def forward(self, imgs):
if torch.rand(1) < 0.5:
# random contrast adjustment
adjust_factor = (torch.rand(1) * self.brightness_adjustment_range + self.min_brightness).item()
return F.adjust_brightness(imgs, adjust_factor)
else:
return imgs
@register_augmentation_op(name="random_blur", features=IMAGE)
class RandomBlur(torch.nn.Module):
def __init__(self, config: RandomBlurConfig):
super().__init__()
self.kernel_size = [config.kernel_size, config.kernel_size]
def forward(self, imgs):
if torch.rand(1) < 0.5:
imgs = F.gaussian_blur(imgs, self.kernel_size)
return imgs
class ImageAugmentation(torch.nn.Module):
def __init__(
self,
augmentation_list: list[BaseAugmentationConfig],
normalize_mean: list[float] | None = None,
normalize_std: list[float] | None = None,
):
super().__init__()
logger.debug(f"Creating augmentation pipeline: {augmentation_list}")
self.normalize_mean = normalize_mean
self.normalize_std = normalize_std
if self.training:
self.augmentation_steps = torch.nn.Sequential()
for aug_config in augmentation_list:
try:
aug_op = get_augmentation_op(IMAGE, aug_config.type)
self.augmentation_steps.append(aug_op(aug_config))
except KeyError:
raise ValueError(f"Invalid augmentation operation specification: {aug_config}")
else:
# TODO: should this raise an exception if not in training mode?
self.augmentation_steps = None
def forward(self, imgs):
if self.augmentation_steps:
# convert from float to uint8 values - this is required for the augmentation
imgs = self._convert_back_to_uint8(imgs)
logger.debug("Executing augmentation pipeline steps: %s", self.augmentation_steps)
imgs = self.augmentation_steps(imgs)
# convert back to float32 values and renormalize if needed
imgs = self._renormalize_image(imgs)
return imgs
# function to partially undo the TorchVision ImageClassification transformation.
# back out the normalization step and convert from float32 to uint8 dtype
# to make the tensor displayable as an image
# crop size remains the same
def _convert_back_to_uint8(self, images):
if self.normalize_mean:
mean = torch.as_tensor(self.normalize_mean, dtype=torch.float32).view(-1, 1, 1)
std = torch.as_tensor(self.normalize_std, dtype=torch.float32).view(-1, 1, 1)
return images.mul(std).add(mean).mul(255.0).type(torch.uint8)
else:
return images.mul(255.0).type(torch.uint8)
# function to redo part of the TorchVision ImageClassification transformation.
# convert uint8 to float32
# apply the imagenet1k normalization
def _renormalize_image(self, images):
if self.normalize_mean:
mean = torch.as_tensor(self.normalize_mean, dtype=torch.float32).view(-1, 1, 1)
std = torch.as_tensor(self.normalize_std, dtype=torch.float32).view(-1, 1, 1)
return images.type(torch.float32).div(255.0).sub(mean).div(std)
else:
return images.type(torch.float32).div(255.0)
@dataclass
class ImageTransformMetadata:
height: int
width: int
num_channels: int
def _get_torchvision_transform(
torchvision_parameters: TVModelVariant,
) -> tuple[torch.nn.Module, ImageTransformMetadata]:
"""Returns a torchvision transform that is compatible with the model variant.
Note that the raw torchvision transform is not returned. Instead, a Sequential module that includes
image resizing is returned. This is because the raw torchvision transform assumes that the input image has
three channels, which is not always the case with images input into Ludwig.
Args:
torchvision_parameters: The parameters for the torchvision model variant.
Returns:
(torchvision_transform, transform_metadata): A torchvision transform and the metadata for the transform.
"""
torchvision_transform_raw = torchvision_parameters.model_weights.DEFAULT.transforms()
torchvision_transform = torch.nn.Sequential(
ResizeChannels(num_channels=3),
torchvision_transform_raw,
)
transform_metadata = ImageTransformMetadata(
height=torchvision_transform_raw.crop_size[0],
width=torchvision_transform_raw.crop_size[0],
num_channels=len(torchvision_transform_raw.mean),
)
return (torchvision_transform, transform_metadata)
def _get_torchvision_parameters(model_type: str, model_variant: str) -> TVModelVariant:
return torchvision_model_registry.get(model_type).get(model_variant)
def is_torchvision_encoder(encoder_obj: Encoder) -> bool:
# TODO(travis): do this through an interface rather than conditional logic
from ludwig.encoders.image.torchvision import TVBaseEncoder
return isinstance(encoder_obj, TVBaseEncoder)
class _ImagePreprocessing(torch.nn.Module):
"""Torchscript-enabled version of preprocessing done by ImageFeatureMixin.add_feature_data."""
def __init__(
self,
metadata: TrainingSetMetadataDict,
torchvision_transform: torch.nn.Module | None = None,
transform_metadata: ImageTransformMetadata | None = None,
):
super().__init__()
self.resize_method = metadata["preprocessing"]["resize_method"]
self.torchvision_transform = torchvision_transform
if transform_metadata is not None:
self.height = transform_metadata.height
self.width = transform_metadata.width
self.num_channels = transform_metadata.num_channels
self.channel_class_map = torch.Tensor([])
else:
self.height = metadata["preprocessing"]["height"]
self.width = metadata["preprocessing"]["width"]
self.num_channels = metadata["preprocessing"]["num_channels"]
self.channel_class_map = torch.ByteTensor(metadata["preprocessing"]["channel_class_map"])
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
"""Takes a list of images and adjusts the size and number of channels as specified in the metadata.
If `v` is already a torch.Tensor, we assume that the images are already preprocessed to be the same size.
"""
# Nested conditional is a workaround to short-circuit boolean evaluation.
if not torch.jit.isinstance(v, list[torch.Tensor]):
if not torch.jit.isinstance(v, torch.Tensor):
raise ValueError(f"Unsupported input: {v}")
if self.torchvision_transform is not None:
# perform pre-processing for torchvision pretrained model encoders
if torch.jit.isinstance(v, list[torch.Tensor]):
imgs = [self.torchvision_transform(img) for img in v]
else:
# convert batch of image tensors to a list and then run torchvision pretrained
# model transforms on each image
imgs = [self.torchvision_transform(img) for img in torch.unbind(v)]
# collect the list of images into a batch
imgs_stacked = torch.stack(imgs)
else:
# perform pre-processing for Ludwig defined image encoders
if torch.jit.isinstance(v, list[torch.Tensor]):
imgs = [resize_image(img, (self.height, self.width), self.resize_method) for img in v]
imgs_stacked = torch.stack(imgs)
else:
imgs_stacked = v
_, num_channels, height, width = imgs_stacked.shape
# Ensure images are the size expected by the model
if height != self.height or width != self.width:
imgs_stacked = resize_image(imgs_stacked, (self.height, self.width), self.resize_method)
# Ensures images have the number of channels expected by the model
if num_channels != self.num_channels:
if self.num_channels == 1:
imgs_stacked = grayscale(imgs_stacked)
elif num_channels < self.num_channels:
extra_channels = self.num_channels - num_channels
imgs_stacked = torch.nn.functional.pad(imgs_stacked, [0, 0, 0, 0, 0, extra_channels])
else:
raise ValueError(
f"Number of channels cannot be reconciled. metadata.num_channels = "
f"{self.num_channels}, but imgs.shape[1] = {num_channels}"
)
# Create class-masked images if required
if self.channel_class_map.shape[0]:
masks = []
for img in imgs_stacked:
mask = get_class_mask_from_image(self.channel_class_map, img)
masks.append(mask)
imgs_stacked = torch.stack(masks)
else:
imgs_stacked = imgs_stacked.type(torch.float32) / 255
return imgs_stacked
class _ImagePostprocessing(torch.nn.Module):
def __init__(self):
super().__init__()
self.logits_key = LOGITS
self.predictions_key = PREDICTIONS
def forward(self, preds: dict[str, torch.Tensor], feature_name: str) -> FeaturePostProcessingOutputDict:
predictions = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.predictions_key)
logits = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.logits_key)
return {self.predictions_key: predictions, self.logits_key: logits}
class _ImagePredict(PredictModule):
def forward(self, inputs: dict[str, torch.Tensor], feature_name: str) -> dict[str, torch.Tensor]:
predictions = output_feature_utils.get_output_feature_tensor(inputs, feature_name, self.predictions_key)
logits = output_feature_utils.get_output_feature_tensor(inputs, feature_name, self.logits_key)
return {self.predictions_key: predictions, self.logits_key: logits}
class ImageFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return IMAGE
@staticmethod
def cast_column(column, backend):
return column
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
return {PREPROCESSING: preprocessing_parameters}
@staticmethod
def _read_image_if_bytes_obj_and_resize(
img_entry: bytes | torch.Tensor | np.ndarray | str,
img_width: int,
img_height: int,
should_resize: bool,
num_channels: int,
resize_method: str,
user_specified_num_channels: bool,
standardize_image: str,
channel_class_map: torch.Tensor,
) -> np.ndarray | None:
""":param img_entry Union[bytes, torch.Tensor, np.ndarray, str]: if str file path to the image else
torch.Tensor of the image itself :param img_width: expected width of the image :param img_height: expected
height of the image :param should_resize: Should the image be resized? :param resize_method: type of
resizing method :param num_channels: expected number of channels in the first image :param
user_specified_num_channels: did the user specify num channels? :param standardize_image: specifies whether
to standarize image with imagenet1k specifications :param channel_class_map: A tensor mapping channel
values to classes, where dim=0 is the class :return: image object as a numpy array.
Helper method to read and resize an image according to model definition. If the user doesn't specify a number of
channels, we use the first image in the dataset as the source of truth. If any image in the dataset doesn't have
the same number of channels as the first image, raise an exception.
If the user specifies a number of channels, we try to convert all the images to the specifications by dropping
channels/padding 0 channels
"""
if isinstance(img_entry, bytes):
img = read_image_from_bytes_obj(img_entry, num_channels)
elif isinstance(img_entry, str):
img = read_image_from_path(img_entry, num_channels)
elif isinstance(img_entry, np.ndarray):
img = torch.from_numpy(np.array(img_entry, copy=True)).permute(2, 0, 1)
else:
img = img_entry
if not isinstance(img, torch.Tensor):
warnings.warn(f"Image with value {img} cannot be read")
return None
img_num_channels = num_channels_in_image(img)
# Convert to grayscale if needed.
if num_channels == 1 and img_num_channels != 1:
img = grayscale(img)
img_num_channels = 1
if should_resize:
img = resize_image(img, (img_height, img_width), resize_method)
if user_specified_num_channels:
# Number of channels is specified by the user
# img_padded = np.zeros((img_height, img_width, num_channels),
# dtype=np.uint8)
# min_num_channels = min(num_channels, img_num_channels)
# img_padded[:, :, :min_num_channels] = img[:, :, :min_num_channels]
# img = img_padded
if num_channels > img_num_channels:
extra_channels = num_channels - img_num_channels
img = torch.nn.functional.pad(img, [0, 0, 0, 0, 0, extra_channels])
if img_num_channels != num_channels:
logger.warning(
"Image has {} channels, where as {} "
"channels are expected. Dropping/adding channels "
"with 0s as appropriate".format(img_num_channels, num_channels)
)
else:
# If the image isn't like the first image, raise exception
if img_num_channels != num_channels:
raise ValueError(
"Image has {} channels, unlike the first image, which "
"has {} channels. Make sure all the images have the same "
"number of channels or use the num_channels property in "
"image preprocessing".format(img_num_channels, num_channels)
)
if img.shape[1] != img_height or img.shape[2] != img_width:
raise ValueError(
"Images are not of the same size. "
"Expected size is {}, "
"current image size is {}."
"Images are expected to be all of the same size "
"or explicit image width and height are expected "
"to be provided. "
"Additional information: "
"https://ludwig-ai.github.io/ludwig-docs/latest/configuration/features/image_features"
"#image-features-preprocessing".format([img_height, img_width, num_channels], img.shape)
)
# Create class-masked image if required
if channel_class_map.shape[0]:
img = get_class_mask_from_image(channel_class_map, img)
else:
# casting and rescaling
img = img.type(torch.float32) / 255
if standardize_image == IMAGENET1K:
img = normalize(img, mean=IMAGENET1K_MEAN, std=IMAGENET1K_STD)
return img.numpy()
@staticmethod
def _read_image_with_pretrained_transform(
img_entry: bytes | torch.Tensor | np.ndarray,
transform_fn: Callable,
) -> np.ndarray | None:
if isinstance(img_entry, bytes):
img = read_image_from_bytes_obj(img_entry)
elif isinstance(img_entry, str):
img = read_image_from_path(img_entry)
elif isinstance(img_entry, np.ndarray):
img = torch.from_numpy(img_entry).permute(2, 0, 1)
else:
img = img_entry
if not isinstance(img, torch.Tensor):
warnings.warn(f"Image with value {img} cannot be read")
return None
img = transform_fn(img)
return img.numpy()
@staticmethod
def _set_image_and_height_equal_for_encoder(
width: int, height: int, preprocessing_parameters: dict, encoder_type: str
) -> tuple[int, int]:
"""Some pretrained image encoders require images with the same dimension, or images with a specific width
and heigh values. The returned width and height are set based on compatibility with the downstream encoder
using the encoder parameters for the feature.
Args:
width: Represents the width of the image. This is either specified in the user config, or inferred using
a sample of images.
height: Represents the height of the image. This is either specified in the user config, or inferred using
a sample of images.
preprocessing_parameters: Parameters defining how the image feature should be preprocessed
encoder_type: The name of the encoder
Return:
(width, height) Updated width and height so that they are equal
"""
if preprocessing_parameters[REQUIRES_EQUAL_DIMENSIONS] and height != width:
width = height = min(width, height)
# Update preprocessing parameters dictionary to reflect new height and width values
preprocessing_parameters["width"] = width
preprocessing_parameters["height"] = height
logger.info(
f"Set image feature height and width to {width} to be compatible with" f" {encoder_type} encoder."
)
return width, height
@staticmethod
def _infer_image_size(
image_sample: list[torch.Tensor],
max_height: int,
max_width: int,
preprocessing_parameters: dict,
encoder_type: str,
) -> tuple[int, int]:
"""Infers the size to use from a group of images. The returned height will be the average height of images
in image_sample rounded to the nearest integer, or max_height. Likewise for width.
Args:
image_sample: Sample of images to use to infer image size. Must be formatted as [channels, height, width].
max_height: Maximum height.
max_width: Maximum width.
preprocessing_parameters: Parameters defining how the image feature should be preprocessed
encoder_type: The name of the encoder
Return:
(height, width) The inferred height and width.
"""
height_avg = sum(x.shape[1] for x in image_sample) / len(image_sample)
width_avg = sum(x.shape[2] for x in image_sample) / len(image_sample)
height = min(int(round(height_avg)), max_height)
width = min(int(round(width_avg)), max_width)
# Update height and width if the downstream encoder requires images
# with the same dimension or specific width and height values
width, height = ImageFeatureMixin._set_image_and_height_equal_for_encoder(
width, height, preprocessing_parameters, encoder_type
)
logger.debug(f"Inferring height: {height} and width: {width}")
return height, width
@staticmethod
def _infer_number_of_channels(image_sample: list[torch.Tensor]):
"""Infers the channel depth to use from a group of images.
We make the assumption that the majority of datasets scraped from the web will be RGB, so if we get a mixed bag
of images we should default to that. However, if the majority of the sample images have a specific channel depth
(other than 3) this is probably intentional so we keep it, but log an info message.
"""
n_images = len(image_sample)
channel_frequency = Counter([num_channels_in_image(x) for x in image_sample])
if channel_frequency[1] > n_images / 2:
# If the majority of images in sample are 1 channel, use 1.
num_channels = 1
elif channel_frequency[2] > n_images / 2:
# If the majority of images in sample are 2 channel, use 2.
num_channels = 2
elif channel_frequency[4] > n_images / 2:
# If the majority of images in sample are 4 channel, use 4.
num_channels = 4
else:
# Default case: use 3 channels.
num_channels = 3
logger.info(f"Inferring num_channels from the first {n_images} images.")
logger.info("\n".join([f" images with {k} channels: {v}" for k, v in sorted(channel_frequency.items())]))
if num_channels == max(channel_frequency, key=channel_frequency.get):
logger.info(
f"Using {num_channels} channels because it is the majority in sample. If an image with"
f" a different depth is read, will attempt to convert to {num_channels} channels."
)
else:
logger.info(f"Defaulting to {num_channels} channels.")
logger.info(
"To explicitly set the number of channels, define num_channels in the preprocessing dictionary of "
"the image input feature config."
)
return num_channels
@staticmethod
def _infer_image_num_classes(
image_sample: list[torch.Tensor],
num_channels: int,
num_classes: int,
) -> torch.Tensor:
"""Infers the number of channel classes from a group of images (for image segmentation). The returned
tensor contains the channel value for each class, where dim=0 is the class.
Args:
image_sample: Sample of images to use to infer image size. Must be formatted as [channels, height, width].
num_channels: Expected number of channels
num_classes: Expected number of channel classes or None
Return:
channel_class_map: A tensor mapping channel values to classes, where dim=0 is the class.
"""
n_images = len(image_sample)
logger.info(f"Inferring num_classes from the first {n_images} images.")
channel_class_map = get_unique_channels(image_sample, num_channels, num_classes)
inferred_num_classes = channel_class_map.shape[0]
if num_classes:
if num_classes < inferred_num_classes:
raise ValueError(
f"Images inferred num classes {inferred_num_classes} exceeds `num_classes` {num_classes}."
)
elif num_classes > inferred_num_classes:
logger.warning(
"Images inferred num classes {} does not match `num_classes` {}. "
"Using inferred num classes {}.".format(inferred_num_classes, num_classes, inferred_num_classes)
)
return channel_class_map
@staticmethod
def _finalize_preprocessing_parameters(
preprocessing_parameters: dict,
encoder_type: str,
column: Series,
) -> tuple:
"""Helper method to determine the height, width and number of channels for preprocessing the image data.
This is achieved by looking at the parameters provided by the user. When there are some missing parameters, we
fall back on to the first image in the dataset. The assumption being that all the images in the data are
expected be of the same size with the same number of channels.
Args:
preprocessing_parameters: Parameters defining how the image feature should be preprocessed
encoder_type: The name of the encoder
column: The data itself. Can be a Pandas, Modin or Dask series.
"""
explicit_height_width = preprocessing_parameters[HEIGHT] or preprocessing_parameters[WIDTH]
explicit_num_channels = NUM_CHANNELS in preprocessing_parameters and preprocessing_parameters[NUM_CHANNELS]
if preprocessing_parameters[INFER_IMAGE_DIMENSIONS] and not (explicit_height_width and explicit_num_channels):
sample_size = min(len(column), preprocessing_parameters[INFER_IMAGE_SAMPLE_SIZE])
else:
sample_size = 1 # Take first image
sample = []
sample_num_bytes = []
failed_entries = []
for image_entry in column.head(sample_size):
if isinstance(image_entry, bytes):
image = read_image_from_bytes_obj(image_entry)
elif isinstance(image_entry, str):
# Tries to read image as PNG or numpy file from the path.
image, num_bytes = read_image_from_path(image_entry, return_num_bytes=True)
if num_bytes is not None:
sample_num_bytes.append(num_bytes)
else:
image = image_entry
if isinstance(image, torch.Tensor):
sample.append(image)
elif isinstance(image, np.ndarray):
sample.append(torch.from_numpy(image).permute(2, 0, 1))
else:
failed_entries.append(image_entry)
if len(sample) == 0:
failed_entries_repr = "\n\t- ".join(failed_entries)
raise ValueError(
f"Images dimensions cannot be inferred. Failed to read {sample_size} images as samples:"
f"\n\t- {failed_entries_repr}."
)
should_resize = False
if explicit_height_width:
should_resize = True
try:
height = int(preprocessing_parameters[HEIGHT])
width = int(preprocessing_parameters[WIDTH])
# Update height and width if the downstream encoder requires images
# with the same dimension or specific width and height values
width, height = ImageFeatureMixin._set_image_and_height_equal_for_encoder(
width, height, preprocessing_parameters, encoder_type
)
except ValueError as e:
raise ValueError("Image height and width must be set and have " "positive integer values: " + str(e))
if height <= 0 or width <= 0:
raise ValueError("Image height and width must be positive integers")
else:
# User hasn't specified height and width.
# Default to inferring from sample or first image.
if preprocessing_parameters[INFER_IMAGE_DIMENSIONS]:
should_resize = True
height, width = ImageFeatureMixin._infer_image_size(
sample,
max_height=preprocessing_parameters[INFER_IMAGE_MAX_HEIGHT],
max_width=preprocessing_parameters[INFER_IMAGE_MAX_WIDTH],
preprocessing_parameters=preprocessing_parameters,
encoder_type=encoder_type,
)
else:
raise ValueError(
"Explicit image width/height are not set, infer_image_dimensions is false, "
"and first image cannot be read, so image dimensions are unknown"
)
if explicit_num_channels:
# User specified num_channels in the model/feature config
user_specified_num_channels = True
num_channels = preprocessing_parameters[NUM_CHANNELS]
else:
user_specified_num_channels = False
if preprocessing_parameters[INFER_IMAGE_DIMENSIONS]:
user_specified_num_channels = True
num_channels = ImageFeatureMixin._infer_number_of_channels(sample)
elif len(sample) > 0:
num_channels = num_channels_in_image(sample[0])
else:
raise ValueError(
"Explicit image num channels is not set, infer_image_dimensions is false, "
"and first image cannot be read, so image num channels is unknown"
)
assert isinstance(num_channels, int), ValueError("Number of image channels needs to be an integer")
average_file_size = np.mean(sample_num_bytes) if sample_num_bytes else None
standardize_image = preprocessing_parameters["standardize_image"]
if standardize_image == "imagenet1k" and num_channels != 3:
warnings.warn(
f"'standardize_image=imagenet1k' is defined only for 'num_channels=3' but "
f"detected 'num_channels={num_channels}'. For this situation setting 'standardize_image=None'.",
RuntimeWarning,
)
standardize_image = None
if preprocessing_parameters[INFER_IMAGE_NUM_CLASSES] or preprocessing_parameters["num_classes"]:
channel_class_map = ImageFeatureMixin._infer_image_num_classes(
sample, num_channels, preprocessing_parameters["num_classes"]
)
else:
channel_class_map = torch.Tensor([])
return (
should_resize,
width,
height,
num_channels,
user_specified_num_channels,
average_file_size,
standardize_image,
channel_class_map,
)
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
set_default_value(feature_config[PREPROCESSING], "in_memory", preprocessing_parameters["in_memory"])
name = feature_config[NAME]
column = input_df[feature_config[COLUMN]]
encoder_type = feature_config[ENCODER][TYPE] if ENCODER in feature_config.keys() else None
src_path = None
if SRC in metadata:
src_path = os.path.dirname(os.path.abspath(metadata.get(SRC)))
abs_path_column = backend.df_engine.map_objects(
column,
lambda row: get_abs_path(src_path, row) if isinstance(row, str) and not has_remote_protocol(row) else row,
)
# determine if specified encoder is a torchvision model
model_type = feature_config[ENCODER].get("type", None) if ENCODER in feature_config.keys() else None
model_variant = feature_config[ENCODER].get("model_variant") if ENCODER in feature_config.keys() else None
if model_variant:
torchvision_parameters = _get_torchvision_parameters(model_type, model_variant)
else:
torchvision_parameters = None
if torchvision_parameters:
logger.warning(
f"Using the transforms specified for the torchvision model {model_type} {model_variant} "
f"This includes setting the number of channels is 3 and resizing the image to the needs of the model."
)
torchvision_transform, transform_metadata = _get_torchvision_transform(torchvision_parameters)
# torchvision_parameters is not None
# perform torchvision model transformations
read_image_if_bytes_obj_and_resize = partial(
ImageFeatureMixin._read_image_with_pretrained_transform,
transform_fn=torchvision_transform,
)
average_file_size = None
# save weight specification in preprocessing section
preprocessing_parameters["torchvision_model_default_weights"] = (
f"{torchvision_parameters.model_weights.DEFAULT}"
)
# add torchvision model id to preprocessing section for torchscript
preprocessing_parameters["torchvision_model_type"] = model_type
preprocessing_parameters["torchvision_model_variant"] = model_variant
# get required setup parameters for in_memory = False processing
height = transform_metadata.height
width = transform_metadata.width
num_channels = transform_metadata.num_channels
channel_class_map = torch.Tensor([])
else:
# torchvision_parameters is None
# perform Ludwig specified transformations
(
should_resize,
width,
height,
num_channels,
user_specified_num_channels,
average_file_size,
standardize_image,
channel_class_map,
) = ImageFeatureMixin._finalize_preprocessing_parameters(
preprocessing_parameters, encoder_type, abs_path_column
)
metadata[name][PREPROCESSING]["height"] = height
metadata[name][PREPROCESSING]["width"] = width
metadata[name][PREPROCESSING]["num_channels"] = num_channels
metadata[name][PREPROCESSING]["num_classes"] = channel_class_map.shape[0]
metadata[name][PREPROCESSING]["channel_class_map"] = channel_class_map.tolist()
read_image_if_bytes_obj_and_resize = partial(
ImageFeatureMixin._read_image_if_bytes_obj_and_resize,
img_width=width,
img_height=height,
should_resize=should_resize,
num_channels=num_channels,
resize_method=preprocessing_parameters["resize_method"],
user_specified_num_channels=user_specified_num_channels,
standardize_image=standardize_image,
channel_class_map=channel_class_map,
)
# TODO: alternatively use get_average_image() for unreachable images
if channel_class_map.shape[0]:
default_image = get_gray_default_image(1, height, width).squeeze(0)
metadata[name]["reshape"] = (height, width)
else:
default_image = get_gray_default_image(num_channels, height, width)
metadata[name]["reshape"] = (num_channels, height, width)
in_memory = feature_config[PREPROCESSING]["in_memory"]
if in_memory or skip_save_processed_input:
proc_col = backend.read_binary_files(
abs_path_column, map_fn=read_image_if_bytes_obj_and_resize, file_size=average_file_size
)
num_failed_image_reads = (
proc_col.isna().sum().compute() if is_dask_series_or_df(proc_col, backend) else proc_col.isna().sum()
)
proc_col = backend.df_engine.map_objects(
proc_col, lambda row: default_image if not isinstance(row, np.ndarray) else row
)
proc_df[feature_config[PROC_COLUMN]] = proc_col
else:
num_images = len(abs_path_column)
num_failed_image_reads = 0
data_fp = backend.cache.get_cache_path(wrap(metadata.get(SRC)), metadata.get(CHECKSUM), TRAINING)
with upload_h5(data_fp) as h5_file:
# todo future add multiprocessing/multithreading
image_dataset = h5_file.create_dataset(
feature_config[PROC_COLUMN] + "_data", (num_images, num_channels, height, width), dtype=np.float32
)
for i, img_entry in enumerate(abs_path_column):
res = read_image_if_bytes_obj_and_resize(img_entry)
if isinstance(res, np.ndarray):
image_dataset[i, :height, :width, :] = res
else:
logger.warning(f"Failed to read image {img_entry} while preprocessing feature `{name}`. ")
image_dataset[i, :height, :width, :] = default_image
num_failed_image_reads += 1
h5_file.flush()
proc_df[feature_config[PROC_COLUMN]] = np.arange(num_images)
if num_failed_image_reads > 0:
logger.warning(
f"Failed to read {num_failed_image_reads} images while preprocessing feature `{name}`. "
"Using default image for these rows in the dataset."
)
return proc_df
class ImageInputFeature(ImageFeatureMixin, InputFeature):
def __init__(self, input_feature_config: ImageInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
# set up for augmentation if it is enabled
if input_feature_config.augmentation:
# assume no image normalize is required
normalize_mean = normalize_std = None
# determine if specified encoder is a torchvision model
if is_torchvision_encoder(self.encoder_obj):
# encoder is a torchvision model
normalize_mean = self.encoder_obj.normalize_mean
normalize_std = self.encoder_obj.normalize_std
else:
# encoder is a Ludwig encoder, determine if standardize_image is set to IMAGENET1K
if input_feature_config.preprocessing.standardize_image == IMAGENET1K:
normalize_mean = IMAGENET1K_MEAN
normalize_std = IMAGENET1K_STD
# create augmentation pipeline object
self.augmentation_pipeline = ImageAugmentation(
input_feature_config.augmentation,
normalize_mean,
normalize_std,
)
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
assert isinstance(inputs, torch.Tensor), f"inputs to image feature must be a torch tensor, got {type(inputs)}"
assert inputs.dtype in [torch.float32], f"inputs to image feature must be a float32 tensor, got {inputs.dtype}"
inputs_encoded = self.encoder_obj(inputs)
return inputs_encoded
@property
def input_dtype(self):
return torch.float32
@property
def input_shape(self) -> torch.Size:
return torch.Size(self.encoder_obj.input_shape)
@property
def output_shape(self) -> torch.Size:
return self.encoder_obj.output_shape
def update_config_after_module_init(self, feature_config):
if is_torchvision_encoder(self.encoder_obj):
# update feature preprocessing parameters to reflect used in torchvision pretrained model
# Note: image height and width is determined by the encoder crop_size attribute. Source of this
# attribute is from the torchvision.transforms._presets.ImageClassification class. This class stores
# crop_size as a single element list. the single element in this list is used to set both the height
# and width of an image.
feature_config.preprocessing.height = self.encoder_obj.crop_size[0]
feature_config.preprocessing.width = self.encoder_obj.crop_size[0]
feature_config.preprocessing.num_channels = self.encoder_obj.num_channels
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
for key in ["height", "width", "num_channels", "standardize_image"]:
if hasattr(feature_config.encoder, key):
setattr(feature_config.encoder, key, feature_metadata[PREPROCESSING][key])
@staticmethod
def get_schema_cls():
return ImageInputFeatureConfig
@staticmethod
def create_preproc_module(metadata: dict[str, Any]) -> torch.nn.Module:
model_type = metadata["preprocessing"].get("torchvision_model_type")
model_variant = metadata["preprocessing"].get("torchvision_model_variant")
if model_variant:
torchvision_parameters = _get_torchvision_parameters(model_type, model_variant)
else:
torchvision_parameters = None
if torchvision_parameters:
torchvision_transform, transform_metadata = _get_torchvision_transform(torchvision_parameters)
else:
torchvision_transform = None
transform_metadata = None
return _ImagePreprocessing(
metadata, torchvision_transform=torchvision_transform, transform_metadata=transform_metadata
)
def get_augmentation_pipeline(self):
return self.augmentation_pipeline
class ImageOutputFeature(ImageFeatureMixin, OutputFeature):
def __init__(
self,
output_feature_config: ImageOutputFeatureConfig | dict,
output_features: dict[str, OutputFeature],
**kwargs,
):
super().__init__(output_feature_config, output_features, **kwargs)
self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
self._setup_loss()
self._setup_metrics()
def logits(self, inputs: dict[str, torch.Tensor], target=None, **kwargs):
return self.decoder_obj(inputs, target=target)
def metric_kwargs(self):
return dict(num_outputs=self.output_shape[0])
def create_predict_module(self) -> PredictModule:
return _ImagePredict()
def get_prediction_set(self):
return self.decoder_obj.get_prediction_set()
@classmethod
def get_output_dtype(cls):
return torch.float32
@property
def output_shape(self) -> torch.Size:
return self.decoder_obj.output_shape
@property
def input_shape(self) -> torch.Size:
return self.decoder_obj.input_shape
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
for key in ["height", "width", "num_channels", "num_classes", "standardize_image"]:
if hasattr(feature_config.decoder, key):
setattr(feature_config.decoder, key, feature_metadata[PREPROCESSING][key])
@staticmethod
def calculate_overall_stats(predictions, targets, metadata):
# no overall stats, just return empty dictionary
return {}
def postprocess_predictions(
self,
result,
metadata,
):
predictions_col = f"{self.feature_name}_{PREDICTIONS}"
if predictions_col in result:
channel_class_map = torch.ByteTensor(metadata[PREPROCESSING]["channel_class_map"])
if channel_class_map.shape[0]:
def class_mask2img(row):
pred = row[predictions_col]
return get_image_from_class_mask(channel_class_map, pred)
result[predictions_col] = result.apply(class_mask2img, axis=1)
return result
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _ImagePostprocessing(metadata)
@staticmethod
def get_schema_cls():
return ImageOutputFeatureConfig
================================================
FILE: ludwig/features/number_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import copy
import logging
from abc import ABC, abstractmethod
from typing import Any
import numpy as np
import pandas as pd
import torch
from torch import nn
from ludwig.constants import COLUMN, HIDDEN, LOGITS, NAME, NUMBER, PREDICTIONS, PROC_COLUMN
from ludwig.features.base_feature import BaseFeatureMixin, InputFeature, OutputFeature, PredictModule
from ludwig.schema.features.number_feature import NumberInputFeatureConfig, NumberOutputFeatureConfig
from ludwig.types import (
FeatureMetadataDict,
FeaturePostProcessingOutputDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils import output_feature_utils
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.types import TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
class NumberTransformer(nn.Module, ABC):
@abstractmethod
def transform(self, x: np.ndarray) -> np.ndarray:
pass
@abstractmethod
def inverse_transform(self, x: np.ndarray) -> np.ndarray:
pass
@abstractmethod
def transform_inference(self, x: torch.Tensor) -> torch.Tensor:
pass
@abstractmethod
def inverse_transform_inference(self, x: torch.Tensor) -> torch.Tensor:
pass
@staticmethod
@abstractmethod
def fit_transform_params(column: np.ndarray, backend: Any) -> dict[str, Any]:
pass
class ZScoreTransformer(NumberTransformer):
def __init__(self, mean: float = None, std: float = None, **kwargs: dict):
super().__init__()
self.mu = float(mean) if mean is not None else mean
self.sigma = float(std) if std is not None else std
self.feature_name = kwargs.get(NAME, "")
if self.sigma == 0:
raise RuntimeError(
f"Cannot apply zscore normalization to `{self.feature_name}` since it has a standard deviation of 0. "
f"This is most likely because `{self.feature_name}` has a constant value of {self.mu} for all rows in "
"the dataset. Consider removing this feature from your Ludwig config since it is not useful for "
"your machine learning model."
)
def transform(self, x: np.ndarray) -> np.ndarray:
return (x - self.mu) / self.sigma
def inverse_transform(self, x: np.ndarray) -> np.ndarray:
return x * self.sigma + self.mu
def transform_inference(self, x: torch.Tensor) -> torch.Tensor:
return (x - self.mu) / self.sigma
def inverse_transform_inference(self, x: torch.Tensor) -> torch.Tensor:
return x * self.sigma + self.mu
@staticmethod
def fit_transform_params(column: np.ndarray, backend: "Backend") -> dict[str, Any]: # noqa
compute = backend.df_engine.compute
return {
"mean": compute(column.astype(np.float32).mean()),
"std": compute(column.astype(np.float32).std()),
}
class MinMaxTransformer(NumberTransformer):
def __init__(self, min: float = None, max: float = None, **kwargs: dict):
super().__init__()
self.min_value = float(min) if min is not None else min
self.max_value = float(max) if max is not None else max
if self.min_value is None or self.max_value is None:
self.range = None
else:
self.range = self.max_value - self.min_value
def transform(self, x: np.ndarray) -> np.ndarray:
return (x - self.min_value) / self.range
def inverse_transform(self, x: np.ndarray) -> np.ndarray:
if self.range is None:
raise ValueError("Numeric transformer needs to be instantiated with " "min and max values.")
return x * self.range + self.min_value
def transform_inference(self, x: torch.Tensor) -> torch.Tensor:
return (x - self.min_value) / self.range
def inverse_transform_inference(self, x: torch.Tensor) -> torch.Tensor:
if self.range is None:
raise ValueError("Numeric transformer needs to be instantiated with " "min and max values.")
return x * self.range + self.min_value
@staticmethod
def fit_transform_params(column: np.ndarray, backend: "Backend") -> dict[str, Any]: # noqa
compute = backend.df_engine.compute
return {
"min": compute(column.astype(np.float32).min()),
"max": compute(column.astype(np.float32).max()),
}
class InterQuartileTransformer(NumberTransformer):
def __init__(self, q1: float = None, q2: float = None, q3: float = None, **kwargs: dict):
super().__init__()
self.q1 = float(q1) if q1 is not None else q1
self.q2 = float(q2) if q2 is not None else q2
self.q3 = float(q3) if q3 is not None else q3
if self.q1 is None or self.q3 is None:
self.interquartile_range = None
else:
self.interquartile_range = self.q3 - self.q1
self.feature_name = kwargs.get(NAME, "")
if self.interquartile_range == 0:
raise RuntimeError(
f"Cannot apply InterQuartileNormalization to `{self.feature_name}` since"
"the interquartile range is 0, which will result in a ZeroDivisionError."
)
def transform(self, x: np.ndarray) -> np.ndarray:
return (x - self.q2) / self.interquartile_range
def inverse_transform(self, x: np.ndarray) -> np.ndarray:
return x * self.interquartile_range + self.q2
def transform_inference(self, x: torch.Tensor) -> torch.Tensor:
return (x - self.q2) / self.interquartile_range
def inverse_transform_inference(self, x: torch.Tensor) -> torch.Tensor:
return x * self.interquartile_range + self.q2
@staticmethod
def fit_transform_params(column: np.ndarray, backend: "Backend") -> dict[str, Any]: # noqa
# backend.df_engine.compute is not used here because `percentile` is not parallelized in dask.
# We compute the percentile directly.
return {
"q1": np.percentile(column.astype(np.float32), 25),
"q2": np.percentile(column.astype(np.float32), 50),
"q3": np.percentile(column.astype(np.float32), 75),
}
class Log1pTransformer(NumberTransformer):
def __init__(self, **kwargs: dict):
super().__init__()
self.feature_name = kwargs.get(NAME, "")
def transform(self, x: np.ndarray) -> np.ndarray:
if np.any(x <= 0):
raise ValueError(
f"One or more values in the `{self.feature_name}` feature are non-positive. "
"log1p normalization is defined only for positive values."
)
return np.log1p(x)
def inverse_transform(self, x: np.ndarray) -> np.ndarray:
return np.expm1(x)
def transform_inference(self, x: torch.Tensor) -> torch.Tensor:
return torch.log1p(x)
def inverse_transform_inference(self, x: torch.Tensor) -> torch.Tensor:
return torch.expm1(x)
@staticmethod
def fit_transform_params(column: np.ndarray, backend: "Backend") -> dict[str, Any]: # noqa
return {}
class IdentityTransformer(NumberTransformer):
def __init__(self, **kwargs):
super().__init__()
def transform(self, x: np.ndarray) -> np.ndarray:
return x
def inverse_transform(self, x: np.ndarray) -> np.ndarray:
return x
def transform_inference(self, x: torch.Tensor) -> torch.Tensor:
return x
def inverse_transform_inference(self, x: torch.Tensor) -> torch.Tensor:
return x
@staticmethod
def fit_transform_params(column: np.ndarray, backend: "Backend") -> dict[str, Any]: # noqa
return {}
numeric_transformation_registry = {
"minmax": MinMaxTransformer,
"zscore": ZScoreTransformer,
"log1p": Log1pTransformer,
"iq": InterQuartileTransformer,
None: IdentityTransformer,
}
def get_transformer(metadata, preprocessing_parameters) -> NumberTransformer:
return get_from_registry(
preprocessing_parameters.get("normalization", None),
numeric_transformation_registry,
)(**metadata)
class _OutlierReplacer(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.zscore_transformer = ZScoreTransformer(**metadata)
self.outlier_threshold = metadata["preprocessing"].get("outlier_threshold")
self.computed_outlier_fill_value = float(metadata["preprocessing"]["computed_outlier_fill_value"])
def forward(self, v: torch.Tensor) -> torch.Tensor:
outliers = self.zscore_transformer.transform_inference(v).abs().gt(self.outlier_threshold)
v_masked = torch.masked_fill(v, outliers, torch.nan)
v = torch.nan_to_num(v_masked, nan=self.computed_outlier_fill_value)
return v.to(dtype=torch.float32)
class _NumberPreprocessing(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.computed_fill_value = float(metadata["preprocessing"]["computed_fill_value"])
self.numeric_transformer = get_transformer(metadata, metadata["preprocessing"])
# Optional outlier replacement
self.outlier_replacer = None
if metadata["preprocessing"].get("outlier_strategy") is not None:
self.outlier_replacer = _OutlierReplacer(metadata)
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
if not torch.jit.isinstance(v, torch.Tensor):
raise ValueError(f"Unsupported input: {v}")
v = torch.nan_to_num(v, nan=self.computed_fill_value)
v = v.to(dtype=torch.float32)
# Handle outliers if needed
if self.outlier_replacer is not None:
v = self.outlier_replacer(v)
return self.numeric_transformer.transform_inference(v)
class _NumberPostprocessing(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.numeric_transformer = get_transformer(metadata, metadata["preprocessing"])
self.predictions_key = PREDICTIONS
def forward(self, preds: dict[str, torch.Tensor], feature_name: str) -> FeaturePostProcessingOutputDict:
predictions = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.predictions_key)
return {self.predictions_key: self.numeric_transformer.inverse_transform_inference(predictions)}
class _NumberPredict(PredictModule):
def __init__(self, clip):
super().__init__()
self.clip = clip
def forward(self, inputs: dict[str, torch.Tensor], feature_name: str) -> dict[str, torch.Tensor]:
logits = output_feature_utils.get_output_feature_tensor(inputs, feature_name, self.logits_key)
predictions = logits
if self.clip is not None:
predictions = torch.clamp(logits, self.clip[0], self.clip[1])
logger.debug(f" clipped_predictions: {predictions}")
return {self.predictions_key: predictions, self.logits_key: logits}
class NumberFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return NUMBER
@staticmethod
def cast_column(column, backend):
return backend.df_engine.df_lib.to_numeric(column, errors="coerce").astype(np.float32)
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
numeric_transformer: NumberTransformer = get_from_registry(
preprocessing_parameters.get("normalization", None),
numeric_transformation_registry,
)
params = numeric_transformer.fit_transform_params(column, backend)
# Ensure mean and std are computed if we're removing outliers
outlier_strategy = preprocessing_parameters.get("outlier_strategy")
if outlier_strategy is not None and ("mean" not in params or "std" not in params):
params.update(ZScoreTransformer.fit_transform_params(column, backend))
return params
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
# Had to replace normalize() function due to issue #1911
# this comment is to provide context for the change.
# original code
# def normalize(series: pd.Series) -> pd.Series:
# series = series.copy()
# numeric_transformer = get_transformer(metadata[feature_config[NAME]], preprocessing_parameters)
# series.update(numeric_transformer.transform(series.values))
# return series
def normalize(series: pd.Series) -> pd.Series:
_feature_metadata = copy.deepcopy(metadata[feature_config[NAME]])
_feature_metadata.update({NAME: feature_config[NAME]})
# retrieve request numeric transformer
numeric_transformer = get_transformer(_feature_metadata, preprocessing_parameters)
# transform input numeric values with specified transformer
transformed_values = numeric_transformer.transform(series.values)
# return transformed values with same index values as original series.
return pd.Series(transformed_values, index=series.index)
input_series = input_df[feature_config[COLUMN]].astype(np.float32)
proc_df[feature_config[PROC_COLUMN]] = backend.df_engine.map_partitions(
input_series, normalize, meta=input_series
)
return proc_df
class NumberInputFeature(NumberFeatureMixin, InputFeature):
def __init__(self, input_feature_config: NumberInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
input_feature_config.encoder.input_size = self.input_shape[-1]
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs):
assert isinstance(inputs, torch.Tensor)
assert inputs.dtype == torch.float32 or inputs.dtype == torch.float64
assert len(inputs.shape) == 1 or (len(inputs.shape) == 2 and inputs.shape[1] == 1)
if len(inputs.shape) == 1:
inputs = inputs[:, None]
inputs_encoded = self.encoder_obj(inputs)
return inputs_encoded
@property
def input_shape(self) -> torch.Size:
return torch.Size([1])
@property
def output_shape(self) -> torch.Size:
return torch.Size(self.encoder_obj.output_shape)
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
pass
@staticmethod
def get_schema_cls():
return NumberInputFeatureConfig
def create_sample_input(self, batch_size: int = 2):
return torch.rand([batch_size])
@classmethod
def get_preproc_input_dtype(cls, metadata: TrainingSetMetadataDict) -> str:
return "float32"
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _NumberPreprocessing(metadata)
class NumberOutputFeature(NumberFeatureMixin, OutputFeature):
def __init__(
self,
output_feature_config: NumberOutputFeatureConfig | dict,
output_features: dict[str, OutputFeature],
**kwargs,
):
self.clip = output_feature_config.clip
super().__init__(output_feature_config, output_features, **kwargs)
self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
self._setup_loss()
self._setup_metrics()
def logits(self, inputs, **kwargs): # hidden
hidden = inputs[HIDDEN]
return self.decoder_obj(hidden)
def create_predict_module(self) -> PredictModule:
if getattr(self, "clip", None) and not (isinstance(self.clip, (list, tuple)) and len(self.clip) == 2):
raise ValueError(
f"The clip parameter of {self.feature_name} is {self.clip}. "
f"It must be a list or a tuple of length 2."
)
return _NumberPredict(getattr(self, "clip", None))
def get_prediction_set(self):
return {PREDICTIONS, LOGITS}
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.decoder_obj.config.input_size])
@classmethod
def get_output_dtype(cls):
return torch.float32
@property
def output_shape(self) -> torch.Size:
return torch.Size([1])
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
pass
@staticmethod
def calculate_overall_stats(predictions, targets, metadata):
# no overall stats, just return empty dictionary
return {}
def postprocess_predictions(
self,
predictions,
metadata,
):
predictions_col = f"{self.feature_name}_{PREDICTIONS}"
if predictions_col in predictions:
# as needed convert predictions make to original value space
numeric_transformer = get_from_registry(
metadata["preprocessing"].get("normalization", None),
numeric_transformation_registry,
)(**metadata)
predictions[predictions_col] = predictions[predictions_col].map(
lambda pred: numeric_transformer.inverse_transform(pred)
)
return predictions
@staticmethod
def get_schema_cls():
return NumberOutputFeatureConfig
@classmethod
def get_postproc_output_dtype(cls, metadata: TrainingSetMetadataDict) -> str:
return "float32"
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _NumberPostprocessing(metadata)
================================================
FILE: ludwig/features/sequence_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from functools import partial
import numpy as np
import torch
from ludwig.constants import (
COLUMN,
LAST_PREDICTIONS,
LENGTHS,
NAME,
PREDICTIONS,
PROBABILITIES,
PROBABILITY,
PROC_COLUMN,
SEQUENCE,
)
from ludwig.features.base_feature import BaseFeatureMixin, InputFeature, OutputFeature, PredictModule
from ludwig.features.feature_utils import compute_sequence_probability, compute_token_probabilities
from ludwig.schema.features.sequence_feature import SequenceInputFeatureConfig, SequenceOutputFeatureConfig
from ludwig.types import (
FeatureMetadataDict,
FeaturePostProcessingOutputDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils import output_feature_utils
from ludwig.utils.math_utils import softmax
from ludwig.utils.strings_utils import (
build_sequence_matrix,
create_vocabulary,
SpecialSymbol,
START_SYMBOL,
STOP_SYMBOL,
UNKNOWN_SYMBOL,
)
from ludwig.utils.tokenizers import get_tokenizer_from_registry
from ludwig.utils.types import TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
class _SequencePreprocessing(torch.nn.Module):
"""Torchscript-enabled version of preprocessing done by SequenceFeatureMixin.add_feature_data."""
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.lowercase = metadata["preprocessing"]["lowercase"]
self.tokenizer_type = metadata["preprocessing"]["tokenizer"]
self.tokenizer = get_tokenizer_from_registry(self.tokenizer_type)(
pretrained_model_name_or_path=metadata["preprocessing"].get("pretrained_model_name_or_path", None)
)
if not isinstance(self.tokenizer, torch.nn.Module):
raise ValueError(f"tokenizer must be a torch.nn.Module, got {self.tokenizer}")
self.padding_symbol = metadata["preprocessing"]["padding_symbol"]
self.unknown_symbol = metadata["preprocessing"]["unknown_symbol"]
self.start_symbol = START_SYMBOL
self.stop_symbol = STOP_SYMBOL
self.max_sequence_length = int(metadata["max_sequence_length"])
self.unit_to_id = metadata["str2idx"]
self.computed_fill_value = metadata["preprocessing"]["computed_fill_value"]
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
"""Takes a list of strings and returns a tensor of token ids."""
if not torch.jit.isinstance(v, list[str]):
raise ValueError(f"Unsupported input: {v}")
futures: list[torch.jit.Future[torch.Tensor]] = []
for sequence in v:
futures.append(
torch.jit.fork(
self._process_sequence,
sequence,
)
)
sequence_matrix = []
for future in futures:
sequence_matrix.append(torch.jit.wait(future))
return torch.stack(sequence_matrix)
def _process_sequence(self, sequence: str) -> torch.Tensor:
sequence = self.computed_fill_value if sequence == "nan" else sequence
# If tokenizer is HF, we defer lowercase transformation to the tokenizer.
if self.lowercase and self.tokenizer_type != "hf_tokenizer":
sequence_str: str = sequence.lower()
else:
sequence_str: str = sequence
sequence_vector = torch.full([self.max_sequence_length], self.unit_to_id[self.padding_symbol])
if self.tokenizer_type == "hf_tokenizer":
# Handles start, stop, and unknown symbols implicitly
unit_sequence = self.tokenizer(sequence)
assert torch.jit.isinstance(unit_sequence, list[int])
# Ensures that the sequence lengths are aligned between the input and output tensors.
sequence_length = min(len(unit_sequence), self.max_sequence_length)
sequence_vector[:sequence_length] = torch.tensor(unit_sequence)[:sequence_length]
return sequence_vector
# If tokenizer is not HF, we manually convert tokens to IDs and insert start, stop, and unknown symbols.
unit_sequence = self.tokenizer(sequence_str)
assert torch.jit.isinstance(unit_sequence, list[str])
sequence_vector[0] = self.unit_to_id[self.start_symbol]
if len(unit_sequence) + 1 < self.max_sequence_length:
sequence_length = len(unit_sequence)
sequence_vector[len(unit_sequence) + 1] = self.unit_to_id[self.stop_symbol]
else:
sequence_length = self.max_sequence_length - 1
for i in range(sequence_length):
curr_unit = unit_sequence[i]
if curr_unit in self.unit_to_id:
curr_id = self.unit_to_id[curr_unit]
else:
curr_id = self.unit_to_id[self.unknown_symbol]
sequence_vector[i + 1] = curr_id
return sequence_vector
class _SequencePostprocessing(torch.nn.Module):
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.max_sequence_length = int(metadata["max_sequence_length"])
self.idx2str = metadata["idx2str"]
self.unknown_symbol = UNKNOWN_SYMBOL
self.predictions_key = PREDICTIONS
self.probabilities_key = PROBABILITIES
self.probability_key = PROBABILITY
def forward(self, preds: dict[str, torch.Tensor], feature_name: str) -> FeaturePostProcessingOutputDict:
pred_predictions = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.predictions_key)
pred_probabilities = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.probabilities_key)
predictions: list[list[str]] = []
for sequence in pred_predictions:
sequence_predictions: list[str] = []
for i in range(self.max_sequence_length):
unit_id = int(sequence[i].item())
if unit_id < len(self.idx2str):
unit_prediction = self.idx2str[unit_id]
else:
unit_prediction = self.unknown_symbol
sequence_predictions.append(unit_prediction)
predictions.append(sequence_predictions)
probabilities, _ = torch.max(pred_probabilities, dim=-1)
probability = torch.sum(torch.log(probabilities.clamp(min=1e-10)), dim=-1)
return {
self.predictions_key: predictions,
self.probabilities_key: probabilities,
self.probability_key: probability,
}
class _SequencePredict(PredictModule):
def forward(self, inputs: dict[str, torch.Tensor], feature_name: str) -> dict[str, torch.Tensor]:
logits = output_feature_utils.get_output_feature_tensor(inputs, feature_name, self.logits_key)
probabilities = torch.softmax(logits, -1)
predictions = torch.argmax(logits, -1)
# predictions: [batch_size, sequence_length]
# probabilities: [batch_size, sequence_length, vocab_size]
# logits: [batch_size, sequence_length, vocab_size]
return {self.predictions_key: predictions, self.probabilities_key: probabilities, self.logits_key: logits}
class SequenceFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return SEQUENCE
@staticmethod
def cast_column(column, backend):
return column.astype(str)
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
vocabulary = create_vocabulary(
column,
preprocessing_parameters["tokenizer"],
lowercase=preprocessing_parameters["lowercase"],
num_most_frequent=preprocessing_parameters["most_common"],
vocab_file=preprocessing_parameters["vocab_file"],
unknown_symbol=preprocessing_parameters["unknown_symbol"],
padding_symbol=preprocessing_parameters["padding_symbol"],
ngram_size=preprocessing_parameters["ngram_size"],
processor=backend.df_engine,
)
logger.info(
f"Max length of feature '{column.name}': {vocabulary.max_sequence_length} (without start and stop symbols)"
)
# Use sequence_length if provided, otherwise use max length found in dataset.
if preprocessing_parameters["sequence_length"] is not None:
logger.info(
f"Setting max length to sequence_length={preprocessing_parameters['sequence_length']} provided in "
f"preprocessing parameters"
)
max_sequence_length = preprocessing_parameters["sequence_length"]
else:
max_sequence_length = vocabulary.max_sequence_length
logger.info(f"Setting max length using dataset: {max_sequence_length} (including start and stop symbols)")
# If max_sequence_length is None, then use the max length found in the dataset.
if (
preprocessing_parameters["max_sequence_length"] is not None
and preprocessing_parameters["max_sequence_length"] < max_sequence_length
):
logger.info(
f"Truncating max length with max_sequence_length={preprocessing_parameters['max_sequence_length']} "
f"from preprocessing parameters"
)
max_sequence_length = preprocessing_parameters["max_sequence_length"]
logger.info(f"Max sequence length is {max_sequence_length} for feature '{column.name}'")
return {
"idx2str": vocabulary.vocab,
"str2idx": vocabulary.str2idx,
"str2freq": vocabulary.str2freq,
"vocab_size": len(vocabulary.vocab),
"max_sequence_length": max_sequence_length,
}
@staticmethod
def feature_data(column, metadata, preprocessing_parameters: PreprocessingConfigDict, backend):
sequence_data = build_sequence_matrix(
sequences=column,
inverse_vocabulary=metadata["str2idx"],
tokenizer_type=preprocessing_parameters["tokenizer"],
length_limit=metadata["max_sequence_length"],
padding_symbol=preprocessing_parameters["padding_symbol"],
padding=preprocessing_parameters["padding"],
unknown_symbol=preprocessing_parameters["unknown_symbol"],
lowercase=preprocessing_parameters["lowercase"],
tokenizer_vocab_file=preprocessing_parameters["vocab_file"],
processor=backend.df_engine,
)
return sequence_data
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
sequence_data = SequenceInputFeature.feature_data(
input_df[feature_config[COLUMN]],
metadata[feature_config[NAME]],
preprocessing_parameters,
backend,
)
proc_df[feature_config[PROC_COLUMN]] = sequence_data
return proc_df
class SequenceInputFeature(SequenceFeatureMixin, InputFeature):
def __init__(self, input_feature_config: SequenceInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs: torch.Tensor, mask=None):
assert isinstance(inputs, torch.Tensor)
assert inputs.dtype in [torch.int8, inputs.dtype, torch.int16, torch.int32, torch.int64]
assert len(inputs.shape) == 2
inputs_exp = inputs.type(torch.int32)
inputs_mask = torch.not_equal(inputs, SpecialSymbol.PADDING.value)
lengths = torch.sum(inputs_mask.type(torch.int32), dim=1)
encoder_output = self.encoder_obj(inputs_exp, mask=inputs_mask)
encoder_output[LENGTHS] = lengths
return encoder_output
@property
def input_dtype(self):
return torch.int32
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.encoder.vocab = feature_metadata["idx2str"]
feature_config.encoder.vocab_size = len(feature_metadata["idx2str"])
feature_config.encoder.max_sequence_length = feature_metadata["max_sequence_length"]
@staticmethod
def get_schema_cls():
return SequenceInputFeatureConfig
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.encoder_obj.config.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
return self.encoder_obj.output_shape
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _SequencePreprocessing(metadata)
class SequenceOutputFeature(SequenceFeatureMixin, OutputFeature):
def __init__(
self,
output_feature_config: SequenceOutputFeatureConfig | dict,
output_features: dict[str, OutputFeature],
**kwargs,
):
super().__init__(output_feature_config, output_features, **kwargs)
self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
self._setup_loss()
self._setup_metrics()
def logits(self, inputs: dict[str, torch.Tensor], target=None):
return self.decoder_obj(inputs, target=target)
def create_predict_module(self) -> PredictModule:
return _SequencePredict()
def get_prediction_set(self):
return self.decoder_obj.get_prediction_set()
@classmethod
def get_output_dtype(cls):
return torch.int32
@property
def input_shape(self) -> torch.Size:
# Dummy implementation.
return torch.Size([1])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.decoder_obj.config.max_sequence_length])
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.decoder.vocab_size = feature_metadata["vocab_size"]
feature_config.decoder.max_sequence_length = feature_metadata["max_sequence_length"]
if isinstance(feature_config.loss.class_weights, (list, tuple)):
if len(feature_config.loss.class_weights) != feature_config.decoder.vocab_size:
raise ValueError(
f"The length of class_weights ({len(feature_config.loss.class_weights)}) is not compatible with "
f"the number of classes ({feature_config.decoder.vocab_size}) for feature {feature_config.column}. "
"Check the metadata JSON file to see the classes "
"and their order and consider there needs to be a weight "
"for the and class too."
)
if isinstance(feature_config.loss.class_weights, dict):
if feature_metadata["str2idx"].keys() != feature_config.loss.class_weights.keys():
raise ValueError(
f"The class_weights keys ({feature_config.loss.class_weights.keys()}) are not compatible with "
f'the classes ({feature_metadata["str2idx"].keys()}) of feature {feature_config.column}. '
"Check the metadata JSON file to see the classes "
"and consider there needs to be a weight "
"for the class too."
)
else:
class_weights = feature_config.loss.class_weights
idx2str = feature_metadata["idx2str"]
class_weights_list = [class_weights[s] for s in idx2str]
feature_config.loss.class_weights = class_weights_list
if feature_config.loss.class_similarities_temperature > 0:
if feature_config.loss.class_similarities is not None:
similarities = feature_config.loss.class_similarities
temperature = feature_config.loss.class_similarities_temperature
curr_row = 0
first_row_length = 0
is_first_row = True
for row in similarities:
if is_first_row:
first_row_length = len(row)
is_first_row = False
curr_row += 1
else:
curr_row_length = len(row)
if curr_row_length != first_row_length:
raise ValueError(
"The length of row {} of the class_similarities "
"of {} is {}, different from the length of "
"the first row {}. All rows must have "
"the same length.".format(
curr_row, feature_config.column, curr_row_length, first_row_length
)
)
else:
curr_row += 1
all_rows_length = first_row_length
if all_rows_length != len(similarities):
raise ValueError(
f"The class_similarities matrix of {feature_config.column} has "
f"{len(similarities)} rows and {all_rows_length} columns, "
"their number must be identical."
)
if all_rows_length != feature_config.decoder.vocab_size:
raise ValueError(
f"The size of the class_similarities matrix of {feature_config.column} is "
f"{all_rows_length}, different from the number of classes "
f"({feature_config.decoder.vocab_size}). Check the metadata JSON file to see the classes "
"and their order and "
"consider and class too."
)
similarities = np.array(similarities, dtype=np.float32)
for i in range(len(similarities)):
similarities[i, :] = softmax(similarities[i, :], temperature=temperature)
feature_config.loss.class_similarities = similarities
else:
raise ValueError(
"class_similarities_temperature > 0, "
"but no class_similarities are provided "
f"for feature {feature_config.column}"
)
@staticmethod
def calculate_overall_stats(predictions, targets, train_set_metadata):
# TODO(Justin): Add a confusion matrix, see
# https://github.com/ludwig-ai/ludwig/blob/tf-legacy/ludwig/features/sequence_feature.py#L411
return {}
def postprocess_predictions(
self,
result,
metadata,
):
predictions_col = f"{self.feature_name}_{PREDICTIONS}"
lengths_col = f"{self.feature_name}_{LENGTHS}"
if predictions_col in result:
if "idx2str" in metadata:
def idx2str(row):
pred = row[predictions_col]
length = metadata["max_sequence_length"]
return [
metadata["idx2str"][token] if token < len(metadata["idx2str"]) else UNKNOWN_SYMBOL
for token in [pred[i] for i in range(length)]
]
result[predictions_col] = result.apply(idx2str, axis=1)
last_preds_col = f"{self.feature_name}_{LAST_PREDICTIONS}"
if last_preds_col in result:
if "idx2str" in metadata:
def last_idx2str(last_pred):
if last_pred < len(metadata["idx2str"]):
return metadata["idx2str"][last_pred]
return UNKNOWN_SYMBOL
result[last_preds_col] = result[last_preds_col].map(last_idx2str)
probs_col = f"{self.feature_name}_{PROBABILITIES}"
prob_col = f"{self.feature_name}_{PROBABILITY}"
if probs_col in result:
# currently does not return full probabilties because usually it is huge:
# dataset x length x classes
# TODO: add a mechanism for letting the user decide to save it
result[probs_col] = result[probs_col].map(compute_token_probabilities)
result[prob_col] = result[probs_col].map(
partial(
compute_sequence_probability,
max_sequence_length=metadata["max_sequence_length"],
return_log_prob=True,
)
)
if lengths_col in result:
del result[lengths_col]
return result
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _SequencePostprocessing(metadata)
@staticmethod
def get_schema_cls():
return SequenceOutputFeatureConfig
================================================
FILE: ludwig/features/set_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from typing import Any
import numpy as np
import torch
from ludwig.constants import COLUMN, HIDDEN, LOGITS, NAME, PREDICTIONS, PROBABILITIES, PROC_COLUMN, SET
from ludwig.features.base_feature import BaseFeatureMixin, InputFeature, OutputFeature, PredictModule
from ludwig.features.feature_utils import set_str_to_idx
from ludwig.schema.features.set_feature import SetInputFeatureConfig, SetOutputFeatureConfig
from ludwig.types import (
FeatureMetadataDict,
FeaturePostProcessingOutputDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils import output_feature_utils
from ludwig.utils.strings_utils import create_vocabulary, UNKNOWN_SYMBOL
from ludwig.utils.tokenizers import get_tokenizer_from_registry, TORCHSCRIPT_COMPATIBLE_TOKENIZERS
from ludwig.utils.types import TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
class _SetPreprocessing(torch.nn.Module):
"""Torchscript-enabled version of preprocessing done by SetFeatureMixin.add_feature_data.
If is_bag is true, forward returns a vector for each sample indicating counts of each token. Else, forward returns a
multi-hot vector for each sample indicating presence of each token.
"""
def __init__(self, metadata: TrainingSetMetadataDict, is_bag: bool = False):
super().__init__()
if metadata["preprocessing"]["tokenizer"] not in TORCHSCRIPT_COMPATIBLE_TOKENIZERS:
raise ValueError(
f"{metadata['preprocessing']['tokenizer']} is not supported by torchscript. Please use "
f"one of {TORCHSCRIPT_COMPATIBLE_TOKENIZERS}."
)
self.lowercase = metadata["preprocessing"]["lowercase"]
self.tokenizer = get_tokenizer_from_registry(metadata["preprocessing"]["tokenizer"])()
self.vocab_size = metadata["vocab_size"]
self.unknown_symbol = UNKNOWN_SYMBOL
self.unit_to_id = metadata["str2idx"]
self.is_bag = is_bag
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
"""Takes a list of strings and returns a tensor of counts for each token."""
if not torch.jit.isinstance(v, list[str]):
raise ValueError(f"Unsupported input: {v}")
if self.lowercase:
sequences = [sequence.lower() for sequence in v]
else:
sequences = v
unit_sequences = self.tokenizer(sequences)
# refines type of unit_sequences from Any to List[List[str]]
assert torch.jit.isinstance(unit_sequences, list[list[str]]), "unit_sequences is not a list of lists."
set_matrix = torch.zeros(len(unit_sequences), self.vocab_size, dtype=torch.float32)
for sample_idx, unit_sequence in enumerate(unit_sequences):
sequence_length = len(unit_sequence)
for i in range(sequence_length):
curr_unit = unit_sequence[i]
if curr_unit in self.unit_to_id:
curr_id = self.unit_to_id[curr_unit]
else:
curr_id = self.unit_to_id[self.unknown_symbol]
if self.is_bag:
set_matrix[sample_idx][curr_id] += 1
else:
set_matrix[sample_idx][curr_id] = 1
return set_matrix
class _SetPostprocessing(torch.nn.Module):
"""Torchscript-enabled version of postprocessing done by SetFeatureMixin.add_feature_data."""
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
self.idx2str = {i: v for i, v in enumerate(metadata["idx2str"])}
self.predictions_key = PREDICTIONS
self.probabilities_key = PROBABILITIES
self.unk = UNKNOWN_SYMBOL
def forward(self, preds: dict[str, torch.Tensor], feature_name: str) -> FeaturePostProcessingOutputDict:
predictions = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.predictions_key)
probabilities = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.probabilities_key)
inv_preds: list[list[str]] = []
filtered_probs: list[torch.Tensor] = []
for sample_idx, sample in enumerate(predictions):
sample_preds: list[str] = []
pos_sample_idxs: list[int] = []
pos_class_idxs: list[int] = []
for class_idx, is_positive in enumerate(sample):
if is_positive == 1:
sample_preds.append(self.idx2str.get(class_idx, self.unk))
pos_sample_idxs.append(sample_idx)
pos_class_idxs.append(class_idx)
inv_preds.append(sample_preds)
filtered_probs.append(probabilities[pos_sample_idxs, pos_class_idxs])
return {
self.predictions_key: inv_preds,
self.probabilities_key: filtered_probs,
}
class _SetPredict(PredictModule):
def __init__(self, threshold):
super().__init__()
self.threshold = threshold
def forward(self, inputs: dict[str, torch.Tensor], feature_name: str) -> dict[str, torch.Tensor]:
logits = output_feature_utils.get_output_feature_tensor(inputs, feature_name, self.logits_key)
probabilities = torch.sigmoid(logits)
predictions = torch.greater_equal(probabilities, self.threshold)
predictions = predictions.type(torch.int64)
return {self.predictions_key: predictions, self.probabilities_key: probabilities, self.logits_key: logits}
class SetFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return SET
@staticmethod
def cast_column(column, backend):
return column.astype(str)
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
vocabulary = create_vocabulary(
column,
preprocessing_parameters["tokenizer"],
num_most_frequent=preprocessing_parameters["most_common"],
lowercase=preprocessing_parameters["lowercase"],
add_special_symbols=False,
processor=backend.df_engine,
)
return {
"idx2str": vocabulary.vocab,
"str2idx": vocabulary.str2idx,
"str2freq": vocabulary.str2freq,
"vocab_size": len(vocabulary.str2idx),
"max_set_size": vocabulary.max_sequence_length,
}
@staticmethod
def feature_data(column, metadata, preprocessing_parameters: PreprocessingConfigDict, backend):
def to_dense(x):
feature_vector = set_str_to_idx(x, metadata["str2idx"], preprocessing_parameters["tokenizer"])
set_vector = np.zeros((len(metadata["str2idx"]),))
set_vector[feature_vector] = 1
return set_vector.astype(np.bool_)
return backend.df_engine.map_objects(column, to_dense)
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
proc_df[feature_config[PROC_COLUMN]] = SetFeatureMixin.feature_data(
input_df[feature_config[COLUMN]],
metadata[feature_config[NAME]],
preprocessing_parameters,
backend,
)
return proc_df
class SetInputFeature(SetFeatureMixin, InputFeature):
def __init__(self, input_feature_config: SetInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs):
assert isinstance(inputs, torch.Tensor)
assert inputs.dtype in [torch.bool, torch.int64, torch.float32]
encoder_output = self.encoder_obj(inputs)
return encoder_output
@property
def input_dtype(self):
return torch.bool
@property
def input_shape(self) -> torch.Size:
return torch.Size([len(self.encoder_obj.config.vocab)])
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.encoder.vocab = feature_metadata["idx2str"]
@staticmethod
def get_schema_cls():
return SetInputFeatureConfig
@property
def output_shape(self) -> torch.Size:
return self.encoder_obj.output_shape
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _SetPreprocessing(metadata)
class SetOutputFeature(SetFeatureMixin, OutputFeature):
def __init__(
self,
output_feature_config: SetOutputFeatureConfig | dict,
output_features: dict[str, OutputFeature],
**kwargs,
):
self.threshold = output_feature_config.threshold
super().__init__(output_feature_config, output_features, **kwargs)
self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
self._setup_loss()
self._setup_metrics()
def logits(self, inputs, **kwargs): # hidden
hidden = inputs[HIDDEN]
return self.decoder_obj(hidden)
def metric_kwargs(self) -> dict[str, Any]:
return {"threshold": self.threshold}
def create_predict_module(self) -> PredictModule:
return _SetPredict(self.threshold)
def get_prediction_set(self):
return {PREDICTIONS, PROBABILITIES, LOGITS}
@classmethod
def get_output_dtype(cls):
return torch.bool
@property
def input_shape(self) -> torch.Size:
return self.decoder_obj.input_shape
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.decoder_obj.config.num_classes])
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.decoder.num_classes = feature_metadata["vocab_size"]
if isinstance(feature_config.loss.class_weights, (list, tuple)):
if len(feature_config.loss.class_weights) != feature_config.decoder.num_classes:
raise ValueError(
f"The length of class_weights ({len(feature_config.loss.class_weights)}) is not compatible with "
f"the number of classes ({feature_config.decoder.num_classes}) for feature {feature_config.name}. "
"Check the metadata JSON file to see the classes "
"and their order and consider there needs to be a weight "
"for the and class too."
)
if isinstance(feature_config.loss.class_weights, dict):
if feature_metadata["str2idx"].keys() != feature_config.loss.class_weights.keys():
raise ValueError(
f"The class_weights keys ({feature_config.loss.class_weights.keys()}) are not compatible with "
f'the classes ({feature_metadata["str2idx"].keys()}) of feature {feature_config.name}. '
"Check the metadata JSON file to see the classes "
"and consider there needs to be a weight "
"for the and class too."
)
else:
class_weights = feature_config.loss.class_weights
idx2str = feature_metadata["idx2str"]
class_weights_list = [class_weights[s] for s in idx2str]
feature_config.loss.class_weights = class_weights_list
@staticmethod
def calculate_overall_stats(predictions, targets, train_set_metadata):
# no overall stats, just return empty dictionary
return {}
def postprocess_predictions(
self,
result,
metadata,
):
predictions_col = f"{self.feature_name}_{PREDICTIONS}"
if predictions_col in result:
def idx2str(pred_set):
return [metadata["idx2str"][i] for i, pred in enumerate(pred_set) if pred]
result[predictions_col] = result[predictions_col].map(idx2str)
probabilities_col = f"{self.feature_name}_{PROBABILITIES}"
if probabilities_col in result:
def get_prob(prob_set):
# Cast to float32 because empty np.array objects are np.float64, causing mismatch errors during saving.
return np.array([prob for prob in prob_set if prob >= self.threshold], dtype=np.float32)
result[probabilities_col] = result[probabilities_col].map(get_prob)
return result
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _SetPostprocessing(metadata)
@staticmethod
def get_schema_cls():
return SetOutputFeatureConfig
================================================
FILE: ludwig/features/text_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from functools import partial
import numpy as np
import torch
from torch import Tensor
from transformers import PreTrainedTokenizer
from ludwig.constants import (
COLUMN,
IGNORE_INDEX_TOKEN_ID,
LAST_PREDICTIONS,
LENGTHS,
NAME,
PREDICTIONS,
PREPROCESSING,
PROBABILITIES,
PROBABILITY,
PROC_COLUMN,
RESPONSE,
TEXT,
)
from ludwig.features.base_feature import BaseFeatureMixin, OutputFeature
from ludwig.features.feature_utils import compute_sequence_probability, compute_token_probabilities
from ludwig.features.sequence_feature import (
_SequencePostprocessing,
_SequencePreprocessing,
SequenceInputFeature,
SequenceOutputFeature,
)
from ludwig.modules.metric_registry import get_metric_tensor_input
from ludwig.schema.features.text_feature import TextInputFeatureConfig, TextOutputFeatureConfig
from ludwig.types import FeatureMetadataDict, ModelConfigDict, PreprocessingConfigDict, TrainingSetMetadataDict
from ludwig.utils.math_utils import softmax
from ludwig.utils.strings_utils import (
build_sequence_matrix,
create_vocabulary,
get_tokenizer,
SpecialSymbol,
UNKNOWN_SYMBOL,
Vocabulary,
)
logger = logging.getLogger(__name__)
def get_decoded_targets_and_predictions(
targets: Tensor,
predictions: dict[str, Tensor],
tokenizer: PreTrainedTokenizer,
) -> tuple[list[str], list[str]]:
"""Returns the decoded targets and predictions, accounting for IGNORE_INDEX_TOKEN_ID."""
# Ensure targets and predictions are on the same device
pred_tensor = predictions[PREDICTIONS]
if targets.device != pred_tensor.device:
targets = targets.to(pred_tensor.device)
sanitized_targets = torch.where(targets != IGNORE_INDEX_TOKEN_ID, targets, tokenizer.pad_token_id)
sanitized_predictions = torch.where(
targets != IGNORE_INDEX_TOKEN_ID,
pred_tensor,
tokenizer.pad_token_id,
)
decoded_targets = tokenizer.batch_decode(sanitized_targets, skip_special_tokens=True)
decoded_predictions = tokenizer.batch_decode(sanitized_predictions, skip_special_tokens=True)
return decoded_targets, decoded_predictions
def _get_metadata_reconciled_max_sequence_length(
preprocessing_parameters: dict, vocabulary: Vocabulary
) -> tuple[int, int]:
"""Reconciles the different ways sequence length can be specified in preprocessing parameters.
If the max sequence length is explicitly specified, we use the minimum of the true maximum sequence length and
the explicitly specified value. If the explicitly specified value is less than the true maximum sequence length, we
log a warning.
If the max sequence length is not specified, we use the true maximum sequence length.
Returns:
Tuple(max_sequence_length, sequence_length_99ptile).
"""
# For sequence features with a fixed length specified by `sequence_length`, use this as the max_sequence_length.
if preprocessing_parameters["sequence_length"] is not None:
return preprocessing_parameters["sequence_length"], preprocessing_parameters["sequence_length"]
# Max sequence length is explicitly set. Use this as the max_sequence_length.
if preprocessing_parameters["max_sequence_length"] is not None:
if preprocessing_parameters["max_sequence_length"] < vocabulary.max_sequence_length:
logger.warning(
f"The max sequence length of the data, {vocabulary.max_sequence_length}, is longer than the max "
f"sequence length set in the config, {preprocessing_parameters['max_sequence_length']}. Note that this "
"will truncate all examples to max_sequence_length="
f"{preprocessing_parameters['max_sequence_length']}."
)
return (
min(vocabulary.max_sequence_length, preprocessing_parameters["max_sequence_length"]),
min(vocabulary.sequence_length_99ptile, preprocessing_parameters["max_sequence_length"]),
)
# Max sequence length is None. Use the max sequence length of the data.
return vocabulary.max_sequence_length, vocabulary.sequence_length_99ptile
class TextFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return TEXT
@staticmethod
def cast_column(column, backend):
return column.astype(str)
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
"""Returns all metadata for the given text feature.
Raises:
ValueError, if the tokenized prompt template is longer than the max sequence length.
"""
prompt_template = config.get("prompt", {}).get("template", "")
vocabulary: Vocabulary = create_vocabulary(
column,
tokenizer_type=preprocessing_parameters["tokenizer"],
num_most_frequent=preprocessing_parameters["most_common"],
lowercase=preprocessing_parameters["lowercase"],
vocab_file=preprocessing_parameters["vocab_file"],
unknown_symbol=preprocessing_parameters["unknown_symbol"],
padding_symbol=preprocessing_parameters["padding_symbol"],
pretrained_model_name_or_path=preprocessing_parameters["pretrained_model_name_or_path"],
ngram_size=preprocessing_parameters["ngram_size"],
compute_idf=preprocessing_parameters["compute_idf"],
processor=backend.df_engine,
prompt_template=prompt_template,
)
# Note: The vocabulary's max_sequence_length includes the prompt template, which is merged into the column prior
# to computing feature metadata.
logger.info(
f"Max length of feature '{column.name}': {vocabulary.max_sequence_length} (without start and stop symbols)"
)
max_sequence_length, max_sequence_length_99ptile = _get_metadata_reconciled_max_sequence_length(
preprocessing_parameters, vocabulary
)
if is_input_feature and max_sequence_length < vocabulary.prompt_template_num_tokens:
raise ValueError(
f"The input feature's max sequence length ({max_sequence_length}) is shorter than the prompt template "
f"length ({vocabulary.prompt_template_num_tokens}). This will truncate all unique information. "
"Consider making the template shorter or increasing the input feature's max sequence length to a "
f"value >> {vocabulary.prompt_template_num_tokens}."
)
logger.info(f"Max sequence length is {max_sequence_length} for feature '{column.name}'")
return {
"idx2str": vocabulary.vocab,
"str2idx": vocabulary.str2idx,
"str2freq": vocabulary.str2freq,
"str2idf": vocabulary.str2idf,
"vocab_size": len(vocabulary.vocab),
"max_sequence_length": max_sequence_length,
"max_sequence_length_99ptile": max_sequence_length_99ptile,
"pad_idx": vocabulary.pad_idx,
"padding_symbol": vocabulary.padding_symbol,
"unknown_symbol": vocabulary.unknown_symbol,
"prompt_template_num_tokens": vocabulary.prompt_template_num_tokens,
}
@staticmethod
def feature_data(column, metadata, preprocessing_parameters: PreprocessingConfigDict, backend) -> np.ndarray:
# TODO(1891): Remove backward compatibility hack once all models have been retrained with Ludwig after
# https://github.com/ludwig-ai/ludwig/pull/1859.
prefix = ""
padding_symbol_metadata_key = "padding_symbol"
unknown_symbol_metadata_key = "unknown_symbol"
if "str2idx" not in metadata:
prefix = "word_"
padding_symbol_metadata_key = "word_pad_symbol"
unknown_symbol_metadata_key = "word_unk_symbol"
# ensure preprocessing param values match the metadata determined from dataset
preprocessing_parameters["padding_symbol"] = metadata[padding_symbol_metadata_key]
preprocessing_parameters["unknown_symbol"] = metadata[unknown_symbol_metadata_key]
if preprocessing_parameters["fill_value"] == UNKNOWN_SYMBOL:
preprocessing_parameters["fill_value"] = preprocessing_parameters["unknown_symbol"]
if (
"computed_fill_value" in preprocessing_parameters
and preprocessing_parameters["computed_fill_value"] == UNKNOWN_SYMBOL
):
preprocessing_parameters["computed_fill_value"] = preprocessing_parameters["unknown_symbol"]
sequences = column
return build_sequence_matrix(
sequences=sequences,
inverse_vocabulary=metadata[f"{prefix}str2idx"],
tokenizer_type=preprocessing_parameters[f"{prefix}tokenizer"],
length_limit=metadata[f"{prefix}max_sequence_length"],
padding_symbol=metadata[padding_symbol_metadata_key],
padding=preprocessing_parameters["padding"],
unknown_symbol=metadata[unknown_symbol_metadata_key],
lowercase=preprocessing_parameters["lowercase"],
tokenizer_vocab_file=preprocessing_parameters[f"{prefix}vocab_file"],
pretrained_model_name_or_path=preprocessing_parameters["pretrained_model_name_or_path"],
processor=backend.df_engine,
)
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
proc_df[feature_config[PROC_COLUMN]] = TextFeatureMixin.feature_data(
input_df[feature_config[COLUMN]],
metadata[feature_config[NAME]],
preprocessing_parameters,
backend,
)
return proc_df
class TextInputFeature(TextFeatureMixin, SequenceInputFeature):
def __init__(self, input_feature_config: TextInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, encoder_obj=encoder_obj, **kwargs)
def forward(self, inputs, mask=None):
assert isinstance(inputs, torch.Tensor)
assert (
inputs.dtype == torch.int8
or inputs.dtype == torch.int16
or inputs.dtype == torch.int32
or inputs.dtype == torch.int64
)
assert len(inputs.shape) == 2
inputs_mask = torch.not_equal(inputs, SpecialSymbol.PADDING.value)
inputs_exp = inputs.type(torch.int32)
lengths = torch.sum(inputs_mask.type(torch.int32), dim=1)
encoder_output = self.encoder_obj(inputs_exp, mask=inputs_mask)
encoder_output[LENGTHS] = lengths
return encoder_output
@property
def input_dtype(self):
return torch.int32
@property
def input_shape(self):
return torch.Size([self.encoder_obj.config.max_sequence_length])
def update_config_after_module_init(self, feature_config):
feature_config.encoder = self.encoder_obj.config
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.encoder.vocab = feature_metadata["idx2str"]
feature_config.encoder.vocab_size = len(feature_metadata["idx2str"])
feature_config.encoder.max_sequence_length = feature_metadata["max_sequence_length"]
feature_config.encoder.pad_idx = feature_metadata["pad_idx"]
feature_config.encoder.num_tokens = len(feature_metadata["idx2str"])
feature_config.encoder.str2freq = feature_metadata["str2freq"]
feature_config.encoder.str2idf = feature_metadata["str2idf"]
feature_config.encoder.skip = feature_metadata[PREPROCESSING].get("cache_encoder_embeddings", False)
@staticmethod
def get_schema_cls():
return TextInputFeatureConfig
@property
def output_shape(self) -> torch.Size:
return self.encoder_obj.output_shape
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _SequencePreprocessing(metadata)
class TextOutputFeature(TextFeatureMixin, SequenceOutputFeature):
def __init__(
self,
output_feature_config: TextOutputFeatureConfig | dict,
output_features: dict[str, OutputFeature],
**kwargs,
):
super().__init__(output_feature_config, output_features, **kwargs)
@classmethod
def get_output_dtype(cls):
return torch.int32
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.decoder_obj.config.max_sequence_length])
def update_metrics(
self,
targets: Tensor,
predictions: dict[str, Tensor],
tokenizer: PreTrainedTokenizer | None = None,
) -> None:
"""Updates metrics with the given targets and predictions.
If decoded_targets and decoded_predictions are provided, as through LLM model types, then additional
response-based metrics like BLEU and ROUGE are also computed.
Args:
targets: Tensor with target values for this output feature.
predictions: Dict of tensors returned by predictions().
"""
if tokenizer is not None:
# Decode the targets and predictions to compute response-based metrics using the initialized tokenizer.
decoded_targets, decoded_predictions = get_decoded_targets_and_predictions(targets, predictions, tokenizer)
for metric_name, metric_fn in self._metric_functions.items():
prediction_key = get_metric_tensor_input(metric_name)
try:
if prediction_key == RESPONSE:
if tokenizer is not None:
# RESPONSE metrics cannot be computed if decoded texts are not provided.
# Decoded texts are only provided using the LLM model type.
if decoded_targets is not None and decoded_predictions is not None:
# Move metric function to the device of the predictions.
# For CUDA, it can be computed on any of the GPUs since it uses allgather to collect
# the results from all GPUs and compute the final metric.
# We use 'predictions' as the key since it is always present in the predictions dict.
device = "cuda" if predictions["predictions"].is_cuda else "cpu"
metric_fn = metric_fn.to(device)
if metric_name == "bleu":
# BLEU takes in targets as a list.
metric_fn.update(decoded_predictions, [decoded_targets])
else:
metric_fn.update(decoded_predictions, decoded_targets)
else:
metric_fn = metric_fn.to(predictions[prediction_key].device)
metric_fn.update(predictions[prediction_key].detach(), targets)
except Exception as e:
logger.info(f"Ran into error when calculating metric {metric_name}. Skipping. The error is: {e}")
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.decoder.vocab_size = feature_metadata["vocab_size"]
feature_config.decoder.max_sequence_length = feature_metadata["max_sequence_length"]
if isinstance(feature_config.loss.class_weights, (list, tuple)):
# [0, 0] for UNK and PAD
feature_config.loss.class_weights = [0, 0] + feature_config.loss.class_weights
if len(feature_config.loss.class_weights) != feature_config.decoder.vocab_size:
raise ValueError(
f"The length of class_weights ({len(feature_config.loss.class_weights)}) is not compatible with "
f"the number of classes ({feature_config.decoder.vocab_size})"
)
if isinstance(feature_config.loss.class_weights, dict):
if feature_metadata["str2idx"].keys() != feature_config.loss.class_weights.keys():
raise ValueError(
f"The class_weights keys ({feature_config.loss.class_weights.keys()}) are not compatible with "
f'the classes ({feature_metadata["str2idx"].keys()}) of feature {feature_config.column}. '
"Check the metadata JSON file to see the classes "
"and consider there needs to be a weight "
"for the class too."
)
else:
class_weights = feature_config.loss.class_weights
idx2str = feature_metadata["idx2str"]
class_weights_list = [class_weights[s] for s in idx2str]
feature_config.loss.class_weights = class_weights_list
if feature_config.loss.class_similarities_temperature > 0:
if feature_config.class_similarities:
distances = feature_config.class_similarities
temperature = feature_config.loss.class_similarities_temperature
for i in range(len(distances)):
distances[i, :] = softmax(distances[i, :], temperature=temperature)
feature_config.loss.class_similarities = distances
else:
raise ValueError(
"class_similarities_temperature > 0,"
"but no class similarities are provided "
"for feature {}".format(feature_config.column)
)
@staticmethod
def calculate_overall_stats(
predictions,
targets,
train_set_metadata,
):
return {}
def postprocess_predictions(
self,
result,
metadata,
):
# todo: refactor to reuse SequenceOutputFeature.postprocess_predictions
predictions_col = f"{self.feature_name}_{PREDICTIONS}"
tokenizer = None
if metadata["preprocessing"]["tokenizer"] == "hf_tokenizer":
tokenizer = get_tokenizer(
metadata["preprocessing"]["tokenizer"],
metadata["preprocessing"]["vocab_file"],
metadata["preprocessing"]["pretrained_model_name_or_path"],
)
if predictions_col in result:
token_col = result[predictions_col]
def idx2str(pred):
if tokenizer is None:
return [
metadata["idx2str"][token] if token < len(metadata["idx2str"]) else UNKNOWN_SYMBOL
for token in pred
]
# Decode each token ID individually. In transformers 5.x, batch_decode
# on a 1D array treats it as a single sequence rather than individual tokens.
return [tokenizer.tokenizer.decode([int(token_id)], skip_special_tokens=True) for token_id in pred]
result[predictions_col] = token_col.map(idx2str)
# Add additional response column that represents the predicted text output
# as a single string instead of a list of tokens.
def idx2response(pred):
if tokenizer is None:
# This works because we treat each word as a token.
return " ".join(
[
metadata["idx2str"][token] if token < len(metadata["idx2str"]) else UNKNOWN_SYMBOL
for token in pred
]
)
return tokenizer.tokenizer.decode(pred, skip_special_tokens=True)
result[f"{self.feature_name}_response"] = token_col.map(idx2response)
last_preds_col = f"{self.feature_name}_{LAST_PREDICTIONS}"
if last_preds_col in result:
def last_idx2str(last_pred):
if last_pred < len(metadata["idx2str"]):
return metadata["idx2str"][last_pred]
return UNKNOWN_SYMBOL
result[last_preds_col] = result[last_preds_col].map(last_idx2str)
probs_col = f"{self.feature_name}_{PROBABILITIES}"
prob_col = f"{self.feature_name}_{PROBABILITY}"
# "Summarizes" the `result`'s probability-related output:
# - result[probs_col]:
# Each row is now a list of "max" probabilities. Each element is the probability of the argmax token for
# the given time step.
#
# Note that we intentionally do not return full list of probabilties for each time step because the output
# of postprocess_predictions is saved to disk and the full probability distribution can be huge,
# especially for large vocab sizes:
# dataset_size x sequence_length x vocab_size
#
# TODO: Add a mechanism that lets the user save the full probability distribution if they want.
# - result[prob_col]:
# Each row is the overall probability of the sequence. This is the product of the max probabilities over
# all time steps.
if probs_col in result:
# result[probs_col]: From PredictModule, each row has a list of size (sequence_length) of a list of
# probabiltiies of (vocab_size). compute_token_probabilities gets the maximum probability per timestep.
result[probs_col] = result[probs_col].map(compute_token_probabilities)
result[prob_col] = result[probs_col].map(
partial(
compute_sequence_probability,
max_sequence_length=metadata["max_sequence_length"],
return_log_prob=True,
),
)
lengths_col = f"{self.feature_name}_{LENGTHS}"
if lengths_col in result:
del result[lengths_col]
return result
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _SequencePostprocessing(metadata)
@staticmethod
def get_schema_cls():
return TextOutputFeatureConfig
================================================
FILE: ludwig/features/timeseries_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from typing import TYPE_CHECKING
import numpy as np
import torch
from ludwig.constants import COLUMN, HIDDEN, LOGITS, NAME, PREDICTIONS, PROC_COLUMN, TIMESERIES
from ludwig.features.base_feature import BaseFeatureMixin, OutputFeature, PredictModule
from ludwig.features.sequence_feature import SequenceInputFeature
from ludwig.features.vector_feature import _VectorPostprocessing, _VectorPredict
from ludwig.schema.features.timeseries_feature import TimeseriesInputFeatureConfig, TimeseriesOutputFeatureConfig
from ludwig.types import FeatureMetadataDict, ModelConfigDict, PreprocessingConfigDict, TrainingSetMetadataDict
from ludwig.utils.tokenizers import get_tokenizer_from_registry, TORCHSCRIPT_COMPATIBLE_TOKENIZERS
from ludwig.utils.types import Series, TorchscriptPreprocessingInput
if TYPE_CHECKING:
from ludwig.backend.base import Backend
logger = logging.getLogger(__name__)
def create_time_delay_embedding(
series: Series, window_size: int, horizon: int, padding_value: int, backend: "Backend"
) -> Series:
"""Time delay embedding from:
https://towardsdatascience.com/machine-learning-for-forecasting-transformations-and-feature-extraction-bbbea9de0ac2
Args:
series: Column-major timeseries data.
window_size: Size of the lookback sliding window for timeseries inputs.
horizon: Size of the forward-looking horizon for timeseries outputs.
padding_value: Value to pad out the window when there is not enough data around the observation.
Returns:
A column of timeseries window arrays in row-major format for training.
"""
# Replace default fill value of "" with nan as we will be assuming numeric values here
series = series.replace("", np.nan)
# Create the list of shifts we want to perform over the series.
# For backwards looking shifts, we want to include the current element, while for forward looking shifts we do not.
# Example:
# window_size=3, horizon=0 --> shift_offsets=[2, 1, 0]
# window_size=0, horizon=2 --> shift_offsets=[-1, -2]
shift_offsets = list(range(window_size - 1, -(horizon + 1), -1))
shifts = [series.shift(i) for i in shift_offsets]
df = backend.df_engine.df_lib.concat(shifts, axis=1)
df.columns = [f"__tmp_column_{j}" for j in shift_offsets]
return df.apply(lambda x: np.nan_to_num(np.array(x.tolist()).astype(np.float32), nan=padding_value), axis=1)
class _TimeseriesPreprocessing(torch.nn.Module):
"""Torchscript-enabled version of preprocessing done by TimeseriesFeatureMixin.add_feature_data."""
def __init__(self, metadata: TrainingSetMetadataDict):
super().__init__()
if metadata["preprocessing"]["tokenizer"] not in TORCHSCRIPT_COMPATIBLE_TOKENIZERS:
raise ValueError(
f"{metadata['preprocessing']['tokenizer']} is not supported by torchscript. Please use "
f"one of {TORCHSCRIPT_COMPATIBLE_TOKENIZERS}."
)
self.tokenizer = get_tokenizer_from_registry(metadata["preprocessing"]["tokenizer"])()
self.padding = metadata["preprocessing"]["padding"]
self.padding_value = float(metadata["preprocessing"]["padding_value"])
self.max_timeseries_length = int(metadata["max_timeseries_length"])
self.computed_fill_value = metadata["preprocessing"]["computed_fill_value"]
def _process_str_sequence(self, sequence: list[str], limit: int) -> torch.Tensor:
float_sequence = [float(s) for s in sequence[:limit]]
return torch.tensor(float_sequence)
def _nan_to_fill_value(self, v: torch.Tensor) -> torch.Tensor:
if v.isnan().any():
tokenized_fill_value = self.tokenizer(self.computed_fill_value)
# refines type of sequences from Any to List[str]
assert torch.jit.isinstance(tokenized_fill_value, list[str])
return self._process_str_sequence(tokenized_fill_value, self.max_timeseries_length)
return v
def forward_list_of_tensors(self, v: list[torch.Tensor]) -> torch.Tensor:
v = [self._nan_to_fill_value(v_i) for v_i in v]
if self.padding == "right":
timeseries_matrix = torch.nn.utils.rnn.pad_sequence(v, batch_first=True, padding_value=self.padding_value)
timeseries_matrix = timeseries_matrix[:, : self.max_timeseries_length]
else:
reversed_timeseries = [torch.flip(v_i[: self.max_timeseries_length], dims=(0,)) for v_i in v]
reversed_timeseries_padded = torch.nn.utils.rnn.pad_sequence(
reversed_timeseries, batch_first=True, padding_value=self.padding_value
)
timeseries_matrix = torch.flip(reversed_timeseries_padded, dims=(1,))
return timeseries_matrix
def forward_list_of_strs(self, v: list[str]) -> torch.Tensor:
v = [self.computed_fill_value if s == "nan" else s for s in v]
sequences = self.tokenizer(v)
# refines type of sequences from Any to List[List[str]]
assert torch.jit.isinstance(sequences, list[list[str]]), "sequences is not a list of lists."
timeseries_matrix = torch.full(
[len(sequences), self.max_timeseries_length], self.padding_value, dtype=torch.float32
)
for sample_idx, str_sequence in enumerate(sequences):
limit = min(len(str_sequence), self.max_timeseries_length)
float_sequence = self._process_str_sequence(str_sequence, limit)
if self.padding == "right":
timeseries_matrix[sample_idx][:limit] = float_sequence
else: # if self.padding == 'left
timeseries_matrix[sample_idx][self.max_timeseries_length - limit :] = float_sequence
return timeseries_matrix
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
"""Takes a list of float values and creates a padded torch.Tensor."""
if torch.jit.isinstance(v, list[torch.Tensor]):
return self.forward_list_of_tensors(v)
if torch.jit.isinstance(v, list[str]):
return self.forward_list_of_strs(v)
raise ValueError(f"Unsupported input: {v}")
class TimeseriesFeatureMixin(BaseFeatureMixin):
@staticmethod
def type():
return TIMESERIES
@staticmethod
def cast_column(column, backend):
return column
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
window_size = preprocessing_parameters.get("window_size", 0) or preprocessing_parameters.get("horizon", 0)
if window_size > 0:
# Column-major data
return {"max_timeseries_length": window_size}
column = column.astype(str)
tokenizer = get_tokenizer_from_registry(preprocessing_parameters["tokenizer"])()
max_length = 0
for timeseries in column:
processed_line = tokenizer(timeseries)
max_length = max(max_length, len(processed_line))
max_length = min(preprocessing_parameters["timeseries_length_limit"], max_length)
return {"max_timeseries_length": max_length}
@staticmethod
def build_matrix(timeseries, tokenizer_name, length_limit, padding_value, padding, backend):
tokenizer = get_tokenizer_from_registry(tokenizer_name)()
ts_vectors = backend.df_engine.map_objects(
timeseries, lambda ts: np.nan_to_num(np.array(tokenizer(ts)).astype(np.float32), nan=padding_value)
)
max_length = backend.df_engine.compute(ts_vectors.map(len).max())
if max_length < length_limit:
logger.debug(f"max length of {tokenizer_name}: {max_length} < limit: {length_limit}")
max_length = length_limit
def pad(vector):
padded = np.full((max_length,), padding_value, dtype=np.float32)
limit = min(vector.shape[0], max_length)
if padding == "right":
padded[:limit] = vector[:limit]
else: # if padding == 'left
padded[max_length - limit :] = vector[:limit]
return padded
return backend.df_engine.map_objects(ts_vectors, pad)
@staticmethod
def feature_data(column, metadata, preprocessing_parameters: PreprocessingConfigDict, backend):
padding_value = preprocessing_parameters["padding_value"]
window_size = preprocessing_parameters.get("window_size", 0)
horizon = preprocessing_parameters.get("horizon", 0)
if window_size > 0 or horizon > 0:
# Column-major data. Convert the column into the row-major embedding
return create_time_delay_embedding(column, window_size, horizon, padding_value, backend)
timeseries_data = TimeseriesFeatureMixin.build_matrix(
column,
preprocessing_parameters["tokenizer"],
metadata["max_timeseries_length"],
padding_value,
preprocessing_parameters["padding"],
backend,
)
return timeseries_data
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
proc_df[feature_config[PROC_COLUMN]] = TimeseriesFeatureMixin.feature_data(
input_df[feature_config[COLUMN]].astype(str),
metadata[feature_config[NAME]],
preprocessing_parameters,
backend,
)
return proc_df
class TimeseriesInputFeature(TimeseriesFeatureMixin, SequenceInputFeature):
def __init__(self, input_feature_config: TimeseriesInputFeatureConfig, encoder_obj=None, **kwargs):
# add required sequence encoder parameters for time series
input_feature_config.encoder.embedding_size = 1
input_feature_config.encoder.should_embed = False
# SequenceInputFeauture's constructor initializes the encoder.
super().__init__(input_feature_config, encoder_obj=encoder_obj, **kwargs)
def forward(self, inputs, mask=None):
assert isinstance(inputs, torch.Tensor)
assert inputs.dtype in [torch.float16, torch.float32, torch.float64]
assert len(inputs.shape) == 2
inputs_exp = inputs.type(torch.float32)
encoder_output = self.encoder_obj(inputs_exp, mask=mask)
return encoder_output
@property
def input_shape(self) -> torch.Size:
return self.encoder_obj.input_shape
@property
def input_dtype(self):
return torch.float32
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.encoder.input_size = feature_metadata["max_timeseries_length"]
feature_config.encoder.max_sequence_length = feature_metadata["max_timeseries_length"]
@staticmethod
def get_schema_cls():
return TimeseriesInputFeatureConfig
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _TimeseriesPreprocessing(metadata)
class TimeseriesOutputFeature(TimeseriesFeatureMixin, OutputFeature):
def __init__(
self,
output_feature_config: TimeseriesOutputFeatureConfig | dict,
output_features: dict[str, OutputFeature],
**kwargs,
):
self.horizon = output_feature_config.horizon
super().__init__(output_feature_config, output_features, **kwargs)
output_feature_config.decoder.output_size = self.horizon
self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
self._setup_loss()
self._setup_metrics()
def logits(self, inputs, **kwargs): # hidden
hidden = inputs[HIDDEN]
return self.decoder_obj(hidden)
def loss_kwargs(self):
return self.loss.to_dict()
def metric_kwargs(self):
return dict(num_outputs=self.output_shape[0])
def create_predict_module(self) -> PredictModule:
return _VectorPredict()
def get_prediction_set(self):
return {PREDICTIONS, LOGITS}
@classmethod
def get_output_dtype(cls):
return torch.float32
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.horizon])
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.horizon = feature_metadata["max_timeseries_length"]
@staticmethod
def calculate_overall_stats(predictions, targets, train_set_metadata):
# no overall stats, just return empty dictionary
return {}
def postprocess_predictions(
self,
result,
metadata,
):
predictions_col = f"{self.feature_name}_{PREDICTIONS}"
if predictions_col in result:
result[predictions_col] = result[predictions_col].map(lambda pred: pred.tolist())
return result
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _VectorPostprocessing()
@staticmethod
def get_schema_cls():
return TimeseriesOutputFeatureConfig
================================================
FILE: ludwig/features/vector_feature.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import numpy as np
import torch
from ludwig.constants import COLUMN, HIDDEN, LOGITS, NAME, PREDICTIONS, PROC_COLUMN, VECTOR
from ludwig.features.base_feature import InputFeature, OutputFeature, PredictModule
from ludwig.schema.features.vector_feature import VectorInputFeatureConfig, VectorOutputFeatureConfig
from ludwig.types import (
FeatureMetadataDict,
FeaturePostProcessingOutputDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils import output_feature_utils
from ludwig.utils.types import TorchscriptPreprocessingInput
logger = logging.getLogger(__name__)
class _VectorPreprocessing(torch.nn.Module):
def forward(self, v: TorchscriptPreprocessingInput) -> torch.Tensor:
if torch.jit.isinstance(v, torch.Tensor):
out = v
elif torch.jit.isinstance(v, list[torch.Tensor]):
out = torch.stack(v)
elif torch.jit.isinstance(v, list[str]):
vectors = []
for sample in v:
vector = torch.tensor([float(x) for x in sample.split()], dtype=torch.float32)
vectors.append(vector)
out = torch.stack(vectors)
else:
raise ValueError(f"Unsupported input: {v}")
if out.isnan().any():
raise ValueError("Scripted NaN handling not implemented for Vector feature")
return out
class _VectorPostprocessing(torch.nn.Module):
def __init__(self):
super().__init__()
self.predictions_key = PREDICTIONS
self.logits_key = LOGITS
def forward(self, preds: dict[str, torch.Tensor], feature_name: str) -> FeaturePostProcessingOutputDict:
predictions = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.predictions_key)
logits = output_feature_utils.get_output_feature_tensor(preds, feature_name, self.logits_key)
return {self.predictions_key: predictions, self.logits_key: logits}
class _VectorPredict(PredictModule):
def forward(self, inputs: dict[str, torch.Tensor], feature_name: str) -> dict[str, torch.Tensor]:
logits = output_feature_utils.get_output_feature_tensor(inputs, feature_name, self.logits_key)
return {self.predictions_key: logits, self.logits_key: logits}
class VectorFeatureMixin:
@staticmethod
def type():
return VECTOR
@staticmethod
def cast_column(column, backend):
return column
@staticmethod
def get_feature_meta(
config: ModelConfigDict,
column,
preprocessing_parameters: PreprocessingConfigDict,
backend,
is_input_feature: bool,
) -> FeatureMetadataDict:
return {"preprocessing": preprocessing_parameters}
@staticmethod
def add_feature_data(
feature_config,
input_df,
proc_df,
metadata,
preprocessing_parameters: PreprocessingConfigDict,
backend,
skip_save_processed_input,
):
"""Expects all the vectors to be of the same size.
The vectors need to be whitespace delimited strings. Missing values are not handled.
"""
if len(input_df[feature_config[COLUMN]]) == 0:
raise ValueError("There are no vectors in the dataset provided")
# Convert the string of features into a numpy array
try:
proc_df[feature_config[PROC_COLUMN]] = backend.df_engine.map_objects(
input_df[feature_config[COLUMN]], lambda x: np.array(x.split(), dtype=np.float32)
)
except ValueError:
logger.error(
"Unable to read the vector data. Make sure that all the vectors"
" are of the same size and do not have missing/null values."
)
raise
# Determine vector size
vector_size = backend.df_engine.compute(proc_df[feature_config[PROC_COLUMN]].map(len).max())
vector_size_param = preprocessing_parameters.get("vector_size")
if vector_size_param is not None:
# TODO(travis): do we even need a user param for vector size if we're going to auto-infer it in all
# cases? Is this only useful as a sanity check for the user to make sure their data conforms to
# expectations?
if vector_size != vector_size_param:
raise ValueError(
"The user provided value for vector size ({}) does not "
"match the value observed in the data: {}".format(preprocessing_parameters, vector_size)
)
else:
logger.debug(f"Detected vector size: {vector_size}")
metadata[feature_config[NAME]]["vector_size"] = vector_size
return proc_df
class VectorInputFeature(VectorFeatureMixin, InputFeature):
def __init__(self, input_feature_config: VectorInputFeatureConfig, encoder_obj=None, **kwargs):
super().__init__(input_feature_config, **kwargs)
# input_feature_config.encoder.input_size = input_feature_config.encoder.vector_size
if encoder_obj:
self.encoder_obj = encoder_obj
else:
self.encoder_obj = self.initialize_encoder(input_feature_config.encoder)
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
assert isinstance(inputs, torch.Tensor)
assert inputs.dtype in [torch.float32, torch.float64]
assert len(inputs.shape) == 2
inputs_encoded = self.encoder_obj(inputs)
return inputs_encoded
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.encoder_obj.config.input_size])
@property
def output_shape(self) -> torch.Size:
return self.encoder_obj.output_shape
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.encoder.input_size = feature_metadata["vector_size"]
@staticmethod
def create_preproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _VectorPreprocessing()
@staticmethod
def get_schema_cls():
return VectorInputFeatureConfig
class VectorOutputFeature(VectorFeatureMixin, OutputFeature):
def __init__(
self,
output_feature_config: VectorOutputFeatureConfig | dict,
output_features: dict[str, OutputFeature],
**kwargs,
):
self.vector_size = output_feature_config.vector_size
super().__init__(output_feature_config, output_features, **kwargs)
output_feature_config.decoder.output_size = self.vector_size
self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
self._setup_loss()
self._setup_metrics()
def logits(self, inputs, **kwargs): # hidden
hidden = inputs[HIDDEN]
return self.decoder_obj(hidden)
def metric_kwargs(self):
return dict(num_outputs=self.output_shape[0])
def create_predict_module(self) -> PredictModule:
return _VectorPredict()
def get_prediction_set(self):
return {PREDICTIONS, LOGITS}
@classmethod
def get_output_dtype(cls):
return torch.float32
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.vector_size])
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@staticmethod
def update_config_with_metadata(feature_config, feature_metadata, *args, **kwargs):
feature_config.vector_size = feature_metadata["vector_size"]
@staticmethod
def calculate_overall_stats(predictions, targets, train_set_metadata):
# no overall stats, just return empty dictionary
return {}
def postprocess_predictions(
self,
result,
metadata,
):
predictions_col = f"{self.feature_name}_{PREDICTIONS}"
if predictions_col in result:
result[predictions_col] = result[predictions_col].map(lambda pred: pred.tolist())
return result
@staticmethod
def create_postproc_module(metadata: TrainingSetMetadataDict) -> torch.nn.Module:
return _VectorPostprocessing()
@staticmethod
def get_schema_cls():
return VectorOutputFeatureConfig
================================================
FILE: ludwig/forecast.py
================================================
import argparse
import logging
import sys
import pandas as pd
from ludwig.api import LudwigModel
from ludwig.backend import ALL_BACKENDS, Backend, initialize_backend
from ludwig.callbacks import Callback
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig
logger = logging.getLogger(__name__)
def forecast_cli(
model_path: str,
dataset: str | dict | pd.DataFrame = None,
data_format: str | None = None,
horizon: int = 1,
output_directory: str | None = None,
output_format: str = "parquet",
callbacks: list[Callback] = None,
backend: Backend | str = None,
logging_level: int = logging.INFO,
**kwargs,
) -> None:
"""Loads pre-trained model to forecast on the provided dataset.
# Inputs
:param model_path: (str) filepath to pre-trained model.
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used in the prediction.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified.
:param horizon: How many samples into the future to forecast.
:param output_directory: (str, default: `'results'`) the directory that
will contain the forecasted values.
:param output_format: (str) format of the output dataset.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param logging_level: (int) Log level that will be sent to stderr.
# Returns
:return: ('None')
"""
model = LudwigModel.load(
model_path,
logging_level=logging_level,
backend=backend,
callbacks=callbacks,
)
model.forecast(
dataset=dataset,
data_format=data_format,
horizon=horizon,
output_directory=output_directory,
output_format=output_format,
)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script loads a pretrained model and uses it to forecast",
prog="ludwig forecast",
usage="%(prog)s [options]",
)
parser.add_argument(
"-n", "--horizon", help="horizon, or number of steps in the future to forecast", type=int, default=1
)
# ---------------
# Data parameters
# ---------------
parser.add_argument("--dataset", help="input data file path", required=True)
parser.add_argument(
"--data_format",
help="format of the input data",
default="auto",
choices=[
"auto",
"csv",
"excel",
"feather",
"fwf",
"hdf5",
"html",
"tables",
"json",
"jsonl",
"parquet",
"pickle",
"sas",
"spss",
"stata",
"tsv",
],
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=True)
# -------------------------
# Output results parameters
# -------------------------
parser.add_argument(
"-od", "--output_directory", type=str, default="results", help="directory that contains the results"
)
parser.add_argument(
"-of",
"--output_format",
help="format to write the output dataset",
default="parquet",
choices=[
"csv",
"parquet",
],
)
parser.add_argument(
"-b",
"--backend",
help="specifies backend to use for parallel / distributed execution, " "defaults to local execution",
choices=ALL_BACKENDS,
)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("forecast", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.forecast")
args.backend = initialize_backend(args.backend)
if args.backend.is_coordinator():
print_ludwig("Forecast", LUDWIG_VERSION)
logger.info(f"Dataset path: {args.dataset}")
logger.info(f"Model path: {args.model_path}")
logger.info("")
forecast_cli(**vars(args))
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/globals.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
LUDWIG_VERSION = "0.11.2"
MODEL_FILE_NAME = "model"
MODEL_WEIGHTS_FILE_NAME = "model_weights"
MODEL_HYPERPARAMETERS_FILE_NAME = "model_hyperparameters.json"
TRAIN_SET_METADATA_FILE_NAME = "training_set_metadata.json"
TRAINING_PROGRESS_TRACKER_FILE_NAME = "training_progress.json"
TRAINING_CHECKPOINTS_DIR_PATH = "training_checkpoints"
TEST_STATISTICS_FILE_NAME = "test_statistics.json"
DESCRIPTION_FILE_NAME = "description.json"
PREDICTIONS_PARQUET_FILE_NAME = "predictions.parquet"
PREDICTIONS_SHAPES_FILE_NAME = "predictions.shapes.json"
TRAINING_PREPROC_FILE_NAME = "training.hdf5"
HYPEROPT_STATISTICS_FILE_NAME = "hyperopt_statistics.json"
CONFIG_YAML = "config.yaml"
DISABLE_PROGRESSBAR = False
def set_disable_progressbar(value):
global DISABLE_PROGRESSBAR
DISABLE_PROGRESSBAR = value
def is_progressbar_disabled():
return DISABLE_PROGRESSBAR
================================================
FILE: ludwig/hyperopt/__init__.py
================================================
================================================
FILE: ludwig/hyperopt/execution.py
================================================
import contextlib
import copy
import datetime
import glob
import json
import logging
import os
import shutil
import sys
import tempfile
import threading
import time
import traceback
import uuid
from collections.abc import Callable
from functools import lru_cache
from inspect import signature
from pathlib import Path
from typing import Any
import ray
from ray import tune
from ray.tune import ExperimentAnalysis, PlacementGroupFactory, register_trainable, Stopper
from ray.tune.schedulers.resource_changing_scheduler import DistributeResources, ResourceChangingScheduler
from ray.tune.search import BasicVariantGenerator, ConcurrencyLimiter, SEARCH_ALG_IMPORT
from ray.tune.utils import wait_for_gpu
from ray.util.queue import Queue as RayQueue
from ludwig.api import LudwigModel
from ludwig.backend import initialize_backend, RAY
from ludwig.backend.ray import initialize_ray
from ludwig.callbacks import Callback
from ludwig.constants import MAXIMIZE, TEST, TRAINER, TRAINING, TYPE, VALIDATION
from ludwig.hyperopt.results import HyperoptResults, TrialResults
from ludwig.hyperopt.search_algos import get_search_algorithm
from ludwig.hyperopt.utils import load_json_values, substitute_parameters
from ludwig.modules.metric_modules import get_best_function
from ludwig.schema.model_types.utils import merge_with_defaults
from ludwig.utils import metric_utils
from ludwig.utils.data_utils import hash_dict, NumpyEncoder
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.fs_utils import has_remote_protocol, safe_move_file
from ludwig.utils.misc_utils import get_from_registry
logger = logging.getLogger(__name__)
def _patch_bohb_configspace_conversion():
"""Monkey-patch TuneBOHB.convert_search_space for ConfigSpace 1.x compatibility.
ConfigSpace 1.x removed the `q` (quantization) parameter from hyperparameter classes.
Ray Tune's BOHB integration still passes `q=...`, so we patch the converter to drop it.
"""
try:
# Check if ConfigSpace 1.x (no 'q' parameter)
import inspect
import math
import ConfigSpace
from ray.tune.search.bohb.bohb_search import TuneBOHB
from ray.tune.search.sample import Categorical, Float, Integer, LogUniform, Normal, Quantized, Uniform
from ray.tune.search.variant_generator import parse_spec_vars
from ray.tune.utils import flatten_dict
sig = inspect.signature(ConfigSpace.UniformFloatHyperparameter.__init__)
if "q" in sig.parameters:
return # Old ConfigSpace, no patching needed
@staticmethod
def convert_search_space(spec):
resolved_vars, domain_vars, grid_vars = parse_spec_vars(spec)
if grid_vars:
raise ValueError(
"Grid search parameters cannot be automatically converted " "to a TuneBOHB search space."
)
spec = flatten_dict(spec, prevent_delimiter=True)
resolved_vars, domain_vars, grid_vars = parse_spec_vars(spec)
def resolve_value(par, domain):
quantize = None
sampler = domain.get_sampler()
if isinstance(sampler, Quantized):
quantize = sampler.q
sampler = sampler.sampler
if isinstance(domain, Float):
if isinstance(sampler, LogUniform):
lower = domain.lower
upper = domain.upper
if quantize:
lower = math.ceil(domain.lower / quantize) * quantize
upper = math.floor(domain.upper / quantize) * quantize
return ConfigSpace.UniformFloatHyperparameter(par, lower=lower, upper=upper, log=True)
elif isinstance(sampler, Uniform):
lower = domain.lower
upper = domain.upper
if quantize:
lower = math.ceil(domain.lower / quantize) * quantize
upper = math.floor(domain.upper / quantize) * quantize
return ConfigSpace.UniformFloatHyperparameter(par, lower=lower, upper=upper, log=False)
elif isinstance(sampler, Normal):
return ConfigSpace.hyperparameters.NormalFloatHyperparameter(
par, mu=sampler.mean, sigma=sampler.sd, log=False
)
elif isinstance(domain, Integer):
if isinstance(sampler, LogUniform):
lower = domain.lower
upper = domain.upper
if quantize:
lower = math.ceil(domain.lower / quantize) * quantize
upper = math.floor(domain.upper / quantize) * quantize
else:
upper -= 1
return ConfigSpace.UniformIntegerHyperparameter(par, lower=lower, upper=upper, log=True)
elif isinstance(sampler, Uniform):
lower = domain.lower
upper = domain.upper
if quantize:
lower = math.ceil(domain.lower / quantize) * quantize
upper = math.floor(domain.upper / quantize) * quantize
else:
upper -= 1
return ConfigSpace.UniformIntegerHyperparameter(par, lower=lower, upper=upper, log=False)
elif isinstance(domain, Categorical):
if isinstance(sampler, Uniform):
return ConfigSpace.CategoricalHyperparameter(par, choices=domain.categories)
raise ValueError(
"TuneBOHB does not support parameters of type "
"`{}` with samplers of type `{}`".format(type(domain).__name__, type(domain.sampler).__name__)
)
cs = ConfigSpace.ConfigurationSpace()
for path, domain in domain_vars:
par = "/".join(str(p) for p in path)
value = resolve_value(par, domain)
cs.add_hyperparameter(value)
return cs
TuneBOHB.convert_search_space = convert_search_space
logger.info("Patched TuneBOHB.convert_search_space for ConfigSpace 1.x compatibility")
except ImportError:
pass # BOHB not installed
_patch_bohb_configspace_conversion()
try:
from ludwig.backend.ray import RayBackend
# TODO: refactor this into an interface
def _is_ray_backend(backend) -> bool:
if isinstance(backend, str):
return backend == RAY
return isinstance(backend, RayBackend)
except ImportError as e:
logger.warning(
f"ImportError (execution.py) failed to import RayBackend with error: \n\t{e}. "
"The LocalBackend will be used instead. If you want to use the RayBackend, please install ludwig[distributed]."
)
class RayBackend:
pass
def _is_ray_backend(backend) -> bool:
return False
def identity(x):
return x
def _get_relative_checkpoints_dir_parts(path: Path):
return path.parts[-2:]
# Follwing disabled at the moment, expect to be re-enabled pending https://github.com/ludwig-ai/ludwig/issues/2039
def ray_resource_allocation_function(
trial_runner: "trial_runner.TrialRunner", # noqa
trial: "Trial", # noqa
result: dict[str, Any],
scheduler: "ResourceChangingScheduler",
):
"""Determine resources to allocate to running trials."""
pgf = DistributeResources(trial_runner, trial, result, scheduler)
# restore original base trial resources
# create bundles
if scheduler.base_trial_resources.required_resources.get("GPU", 0):
bundles = [{"CPU": 1, "GPU": 1}] * int(pgf.required_resources["GPU"])
else:
bundles = [{"CPU": 1}] * (int(pgf.required_resources["CPU"] - 0.001))
# we can't set Trial actor's CPUs to 0 so we just go very low
bundles = [{"CPU": 0.001}] + bundles
pgf = PlacementGroupFactory(bundles)
return pgf
def _create_tune_checkpoint(save_path):
"""Create a Ray Tune Checkpoint from a model save path."""
def ignore_dot_files(src, files):
return [f for f in files if f.startswith(".")]
tmpdir = tempfile.mkdtemp()
checkpoint_model = os.path.join(tmpdir, "model")
if os.path.exists(save_path):
copy_id = uuid.uuid4()
tmp_dst = f"{checkpoint_model}.{copy_id}.tmp"
shutil.copytree(save_path, tmp_dst, ignore=ignore_dot_files)
try:
os.rename(tmp_dst, checkpoint_model)
except Exception:
shutil.rmtree(tmp_dst)
return tune.Checkpoint.from_directory(tmpdir)
class RayTuneExecutor:
def __init__(
self,
parameters: dict,
output_feature: str,
metric: str,
goal: str,
split: str,
search_alg: dict | None = None,
cpu_resources_per_trial: int = None,
gpu_resources_per_trial: int = None,
kubernetes_namespace: str = None,
time_budget_s: int | float | datetime.timedelta = None,
max_concurrent_trials: int | None = None,
num_samples: int = 1,
scheduler: dict | None = None,
**kwargs,
) -> None:
if ray is None:
raise ImportError("ray module is not installed. To install it, try running pip install ray")
self.output_feature = output_feature
self.metric = metric
self.split = split
initialize_ray()
self.search_space, self.decode_ctx = self._get_search_space(parameters)
self.num_samples = num_samples
self.goal = goal
self.search_algorithm = get_search_algorithm(search_alg)
self.scheduler = None if scheduler is None else tune.create_scheduler(scheduler[TYPE], **scheduler)
self.output_feature = output_feature
self.metric = metric
self.split = split
self.trial_id = 0
self.cpu_resources_per_trial = cpu_resources_per_trial
self.gpu_resources_per_trial = gpu_resources_per_trial
self.kubernetes_namespace = kubernetes_namespace
self.time_budget_s = time_budget_s
self.max_concurrent_trials = max_concurrent_trials
self.sync_config = None
self.sync_client = None
# Head node is the node to which all checkpoints are synced if running on a K8s cluster.
self.head_node_ip = ray.util.get_node_ip_address()
def _get_search_space(self, parameters: dict) -> tuple[dict, dict]:
"""Encode search space parameters as JSON with context for decoding."""
config = {}
ctx = {}
for param, values in parameters.items():
# Encode list and dict types as JSON encoded strings to
# workaround type limitations of the underlying frameworks
values = self.encode_values(param, values, ctx)
param_search_type = values["space"].lower()
if hasattr(tune, param_search_type):
param_search_space = getattr(tune, param_search_type)
else:
raise ValueError(f"'{param_search_type}' is not a supported Ray Tune search space")
param_search_input_args = {}
param_search_space_sig = signature(param_search_space)
for arg in param_search_space_sig.parameters.values():
if arg.name in values:
param_search_input_args[arg.name] = values[arg.name]
else:
if arg.default is arg.empty:
raise ValueError(f"Parameter '{arg}' not defined for {param}")
config[param] = param_search_space(**param_search_input_args)
return config, ctx
@staticmethod
def encode_values(param: str, values: dict, ctx: dict) -> dict:
"""JSON encodes any search spaces whose values are lists / dicts.
Only applies to grid search and choice options. See here for details:
https://docs.ray.io/en/master/tune/api_docs/search_space.html#random-distributions-api
"""
values = values.copy()
for key in ["values", "categories"]:
if key in values and not isinstance(values[key][0], (int, float)):
values[key] = [json.dumps(v) for v in values[key]]
ctx[param] = json.loads
return values
@staticmethod
def decode_values(config: dict, ctx: dict) -> dict:
"""Decode config values with the decode function in the context.
Uses the identity function if no encoding is needed.
"""
return {key: ctx.get(key, identity)(value) for key, value in config.items()}
def _has_metric(self, stats, split):
if not stats:
return False
if split is not None:
if split not in stats:
return False
stats = stats[split]
if self.output_feature not in stats:
return False
stats = stats[self.output_feature]
if self.metric not in stats:
return False
stats = stats[self.metric]
return len(stats) > 0
def _has_eval_metric(self, stats):
if stats is None:
return False
if self.output_feature not in stats:
return False
stats = stats[self.output_feature]
for metric_part in self.metric.split("."):
if not isinstance(stats, dict) or metric_part not in stats:
return False
stats = stats[metric_part]
return isinstance(stats, float)
def get_metric_score(self, train_stats) -> float:
if self._has_metric(train_stats, VALIDATION):
logger.info("Returning metric score from training (validation) statistics")
return self.get_metric_score_from_train_stats(train_stats, VALIDATION)
elif self._has_metric(train_stats, TRAINING):
logger.info("Returning metric score from training split statistics, " "as no validation was given")
return self.get_metric_score_from_train_stats(train_stats, TRAINING)
else:
raise RuntimeError("Unable to obtain metric score from missing training (validation) statistics")
def get_metric_score_from_eval_stats(self, eval_stats) -> float | list:
stats = eval_stats[self.output_feature]
for metric_part in self.metric.split("."):
if isinstance(stats, dict):
if metric_part in stats:
stats = stats[metric_part]
else:
raise ValueError(f"Evaluation statistics do not contain the metric {self.metric}")
else:
raise ValueError(f"Evaluation statistics do not contain the metric {self.metric}")
if not isinstance(stats, float):
raise ValueError(f"The metric {self.metric} in evaluation statistics is not a numerical value: {stats}")
return stats
def get_metric_score_from_train_stats(self, train_stats, select_split=None) -> float:
select_split = select_split or VALIDATION
# grab the results of the model with highest validation test performance
train_valiset_stats = train_stats[select_split]
validation_field_result = train_valiset_stats[self.output_feature]
best_function = get_best_function(self.metric)
# results of the model with highest validation test performance
epoch_best_validation_metric, best_validation_metric = best_function(
enumerate(validation_field_result[self.metric]), key=lambda pair: pair[1]
)
return best_validation_metric
def sort_hyperopt_results(self, hyperopt_results):
return sorted(
hyperopt_results, key=lambda hp_res: hp_res.metric_score, reverse=self.hyperopt_sampler.goal == MAXIMIZE
)
@property
def _cpu_resources_per_trial_non_none(self):
return self.cpu_resources_per_trial if self.cpu_resources_per_trial is not None else 1
@property
def _gpu_resources_per_trial_non_none(self):
return self.gpu_resources_per_trial if self.gpu_resources_per_trial is not None else 0
def _get_remote_checkpoint_dir(self, trial_dir: Path) -> str | tuple[str, str] | None:
"""Get the path to remote checkpoint directory."""
if self.sync_config is None:
return None
if self.sync_config.upload_dir is not None:
# Cloud storage sync config
remote_checkpoint_dir = os.path.join(
self.sync_config.upload_dir, *_get_relative_checkpoints_dir_parts(trial_dir)
)
return remote_checkpoint_dir
elif self.kubernetes_namespace is not None:
# Kubernetes sync config. Returns driver node name and path.
# When running on kubernetes, each trial is rsynced to the node running the main process.
node_name = self._get_kubernetes_node_address_by_ip()(self.head_node_ip)
return (node_name, trial_dir)
else:
logger.warning(
"Checkpoint syncing disabled as syncing is only supported to remote cloud storage or on Kubernetes "
"clusters is supported. To use syncing, set the kubernetes_namespace in the config or use a cloud URI "
"as the output directory."
)
return None
@lru_cache(maxsize=1)
def _get_kubernetes_node_address_by_ip(self) -> Callable:
"""Returns a method to get the node name by IP address within a K8s cluster."""
assert self.kubernetes_namespace is not None
from ray.tune.integration.kubernetes import KubernetesSyncer
# Initialized with null local and remote directories as we only need to use get_node_address_by_ip.
kubernetes_syncer = KubernetesSyncer(None, None)
return kubernetes_syncer.get_node_address_by_ip
# For specified [stopped] trial, remove checkpoint marker on any partial checkpoints
@staticmethod
def _remove_partial_checkpoints(trial_path: str):
marker_paths = glob.glob(os.path.join(glob.escape(trial_path), "checkpoint_*/.is_checkpoint"))
for marker_path in marker_paths:
chkpt_dir = os.path.dirname(marker_path)
metadata_file = glob.glob(os.path.join(glob.escape(chkpt_dir), "*.tune_metadata"))
# glob.glob: filenames starting with a dot are special cases
# that are not matched by '*' and '?' patterns.
metadata_file += glob.glob(os.path.join(glob.escape(chkpt_dir), ".tune_metadata"))
metadata_file = list(set(metadata_file)) # avoid duplication
if len(metadata_file) < 1:
# Remove checkpoint marker on incomplete directory
os.remove(marker_path)
@contextlib.contextmanager
def _get_best_model_path(self, trial_or_path, analysis: ExperimentAnalysis) -> str:
# Accept either a Trial object or a path string
from ray.tune.experiment.trial import Trial
if isinstance(trial_or_path, str):
trial_path = trial_or_path
else:
trial_path = trial_or_path.local_path
remote_checkpoint_dir = self._get_remote_checkpoint_dir(Path(trial_path))
if remote_checkpoint_dir is not None and self.sync_client is not None:
self.sync_client.sync_down(remote_checkpoint_dir, trial_path)
self.sync_client.wait_or_retry()
self._remove_partial_checkpoints(trial_path) # needed by get_best_checkpoint
# get_best_checkpoint requires a Trial object in Ray 2.x
if isinstance(trial_or_path, Trial):
trial = trial_or_path
else:
# Try to find the trial by matching its path
trial = None
for t in analysis.trials:
if t.local_path and t.local_path.rstrip("/") == trial_path.rstrip("/"):
trial = t
break
try:
if trial is not None:
checkpoint = analysis.get_best_checkpoint(trial)
else:
checkpoint = None
except Exception:
logger.warning(
f"Cannot get best model path for {trial_path} due to exception below:" f"\n{traceback.format_exc()}"
)
yield None
return
if checkpoint is not None:
with checkpoint.as_directory() as path:
yield path
else:
yield checkpoint
@staticmethod
def _evaluate_best_model(
trial,
trial_path,
best_model_path,
dataset,
data_format,
skip_save_unprocessed_output,
skip_save_predictions,
skip_save_eval_stats,
gpus,
gpu_memory_limit,
allow_parallel_threads,
backend,
debug,
):
model_path = os.path.join(best_model_path, "model")
if not os.path.isdir(model_path):
logger.warning(
f"Best model path {model_path} does not exist or is incomplete. "
"This can happen when time budget expires mid-checkpoint. Skipping evaluation."
)
return
best_model = LudwigModel.load(
model_path,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
from_checkpoint=True,
)
if best_model.config[TRAINER]["eval_batch_size"]:
batch_size = best_model.config[TRAINER]["eval_batch_size"]
else:
batch_size = best_model.config[TRAINER]["batch_size"]
try:
eval_stats, _, _ = best_model.evaluate(
dataset=dataset,
data_format=data_format,
batch_size=batch_size,
output_directory=trial_path,
skip_save_unprocessed_output=skip_save_unprocessed_output,
skip_save_predictions=skip_save_predictions,
skip_save_eval_stats=skip_save_eval_stats,
collect_predictions=False,
collect_overall_stats=True,
return_type="dict",
debug=debug,
)
trial["eval_stats"] = json.dumps(eval_stats, cls=NumpyEncoder)
except NotImplementedError:
logger.warning(
"Skipping evaluation as the necessary methods are not "
"supported. Full exception below:\n"
f"{traceback.format_exc()}"
)
def _run_experiment(
self,
config,
checkpoint_dir,
hyperopt_dict,
decode_ctx,
is_using_ray_backend=False,
):
# Ray Tune redirects stdout/stderr through a Tee object that may not
# implement isatty(), which ray.data's progress bar code requires.
# Patch it to avoid AttributeError.
for stream in (sys.stdout, sys.stderr):
if not hasattr(stream, "isatty"):
stream.isatty = lambda: False
for gpu_id in ray.get_gpu_ids():
# Previous trial may not have freed its memory yet, so wait to avoid OOM
wait_for_gpu(gpu_id)
# Some config values may be JSON encoded as strings, so decode them here
config = self.decode_values(config, decode_ctx)
# Remove mlflow injected config parameters: https://github.com/ludwig-ai/ludwig/issues/2288
if "mlflow" in config:
del config["mlflow"]
trial_id = tune.get_context().get_trial_id()
trial_dir = Path(tune.get_context().get_trial_dir())
modified_config = substitute_parameters(copy.deepcopy(hyperopt_dict["config"]), config)
modified_config = merge_with_defaults(modified_config)
hyperopt_dict["config"] = modified_config
hyperopt_dict["experiment_name "] = f'{hyperopt_dict["experiment_name"]}_{trial_id}'
hyperopt_dict["output_directory"] = str(trial_dir)
tune_executor = self
if is_using_ray_backend:
ray_queue = RayQueue(actor_options={"num_cpus": 0})
else:
ray_queue = None
def report(progress_tracker, save_path=None):
# The progress tracker's metrics are nested dictionaries of TrainerMetrics: feature_name -> metric_name ->
# List[TrainerMetric], with one entry per training checkpoint, according to steps_per_checkpoint.
# We reduce the dictionary of TrainerMetrics to a simple list of floats for interfacing with Ray Tune.
train_stats = {
TRAINING: metric_utils.reduce_trainer_metrics_dict(progress_tracker.train_metrics),
VALIDATION: metric_utils.reduce_trainer_metrics_dict(progress_tracker.validation_metrics),
TEST: metric_utils.reduce_trainer_metrics_dict(progress_tracker.test_metrics),
}
metric_score = tune_executor.get_metric_score(train_stats)
report_kwargs = {
"metrics": {
"parameters": json.dumps(config, cls=NumpyEncoder),
"metric_score": metric_score,
"training_stats": json.dumps(train_stats, cls=NumpyEncoder),
"eval_stats": "{}",
"trial_id": tune.get_context().get_trial_id(),
"trial_dir": str(tune.get_context().get_trial_dir()),
}
}
if save_path is not None:
report_kwargs["checkpoint"] = _create_tune_checkpoint(save_path)
tune.report(**report_kwargs)
class RayTuneReportCallback(Callback):
def __init__(self):
super().__init__()
self.last_steps = 0
self.resume_ckpt_dir = None
def _get_remote_checkpoint_dir(self) -> str | tuple[str, str] | None:
# sync client has to be recreated to avoid issues with serialization
return tune_executor._get_remote_checkpoint_dir(trial_dir)
def _checkpoint_progress(self, trainer, progress_tracker, save_path) -> None:
"""Checkpoints the progress tracker."""
if is_using_ray_backend:
# Pass the save_path directly through the queue. On single-node clusters,
# the trial driver and training workers share the same filesystem.
# For multi-node, the checkpoint should be on shared storage.
ray_queue.put((progress_tracker, save_path))
return
# For non-Ray backend, report metrics + checkpoint together
report(progress_tracker, save_path=save_path)
def on_train_start(self, model, config: dict[str, Any], config_fp: str | None):
if is_using_ray_backend and checkpoint_dir:
# Store the checkpoint directory path for syncing to the trainer worker.
self.resume_ckpt_dir = checkpoint_dir
def on_trainer_train_setup(self, trainer, save_path, is_coordinator):
# Check local rank before manipulating files, as otherwise there will be a race condition
# between multiple workers running on the same node.
if self.resume_ckpt_dir is not None and trainer.local_rank == 0:
# Resume from a previous checkpoint by syncing files from the checkpoint
# directory to the save_path.
ckpt_path = self.resume_ckpt_dir
# Attempt an atomic move from the ckpt_path to the save_path
# This may first require removing the existing save_path
tmp_path = save_path + ".tmp"
if os.path.exists(save_path):
os.rename(save_path, tmp_path)
try:
model_path = os.path.join(ckpt_path, "model")
if os.path.exists(model_path):
safe_move_file(model_path, save_path)
elif os.path.exists(ckpt_path):
safe_move_file(ckpt_path, save_path)
except Exception:
# Rollback from partial changes. Remove the save_path
# and move the original save_path back.
if os.path.exists(save_path):
shutil.rmtree(save_path)
if os.path.exists(tmp_path):
os.rename(tmp_path, save_path)
raise
# Cleanup the backup save_path as it's no longer needed
if os.path.exists(tmp_path):
shutil.rmtree(tmp_path)
# Sync all workers here before continuing to training
trainer.barrier()
def on_eval_end(self, trainer, progress_tracker, save_path):
progress_tracker.tune_checkpoint_num += 1
self.last_steps = progress_tracker.steps
self._checkpoint_progress(trainer, progress_tracker, save_path)
def on_trainer_train_teardown(self, trainer, progress_tracker, save_path, is_coordinator):
if is_coordinator and progress_tracker.steps > self.last_steps:
# Note: Calling tune.report in both on_eval_end() and here can cause multiprocessing issues
# for some ray samplers if not steps have happened since the last eval.
self._checkpoint_progress(trainer, progress_tracker, save_path)
callbacks = hyperopt_dict.get("callbacks") or []
hyperopt_dict["callbacks"] = callbacks + [RayTuneReportCallback()]
# set tune resources
if is_using_ray_backend:
resources = tune.get_context().get_trial_resources()
# check if we are using at least 1 gpu per trial
use_gpu = bool(self._gpu_resources_per_trial_non_none)
# get the resources assigned to the current trial
num_gpus = resources.required_resources.get("GPU", 0)
num_cpus = resources.required_resources.get("CPU", 1) if num_gpus == 0 else 0
distributed_kwargs = {
"num_workers": int(num_gpus) if use_gpu else 1,
"use_gpu": use_gpu,
"resources_per_worker": {
"CPU": num_cpus,
"GPU": 1 if use_gpu else 0,
},
}
hyperopt_dict["backend"].set_distributed_kwargs(**distributed_kwargs)
logger.debug(f"Trial distributed kwargs: {distributed_kwargs}")
stats = []
thread_error = [None] # Use list to allow mutation from nested function
def _run():
try:
train_stats, eval_stats = run_experiment(
**hyperopt_dict,
model_resume_path=checkpoint_dir,
parameters=config,
)
stats.append((train_stats, eval_stats))
except Exception as e:
thread_error[0] = e
logger.error(f"Error in hyperopt trial thread: {e}")
if is_using_ray_backend:
# We have to pull the results to the trial actor
# from worker actors, as the Tune session is running
# only on the trial actor
thread = threading.Thread(target=_run)
thread.daemon = True
thread.start()
def check_queue():
qsize = ray_queue.qsize()
if qsize:
results = ray_queue.get_nowait_batch(qsize)
for progress_tracker, save_path in results:
report(progress_tracker, save_path=save_path)
while thread.is_alive():
thread.join(timeout=0)
check_queue()
time.sleep(0.1)
thread.join()
check_queue()
else:
# remove threading overhead
_run()
if thread_error[0] is not None:
raise RuntimeError(f"Experiment failed: {thread_error[0]}") from thread_error[0]
if not stats:
raise RuntimeError("Experiment did not complete.")
train_stats, eval_stats = stats.pop()
metric_score = self.get_metric_score(train_stats)
tune.report(
metrics={
"parameters": json.dumps(config, cls=NumpyEncoder),
"metric_score": metric_score,
"training_stats": json.dumps(train_stats, cls=NumpyEncoder),
"eval_stats": json.dumps(eval_stats, cls=NumpyEncoder),
"trial_id": tune.get_context().get_trial_id(),
"trial_dir": str(tune.get_context().get_trial_dir()),
}
)
def execute(
self,
config,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
data_format=None,
experiment_name="hyperopt",
model_name="run",
resume=None,
skip_save_training_description=False,
skip_save_training_statistics=False,
skip_save_model=False,
skip_save_progress=False,
skip_save_log=False,
skip_save_processed_input=True,
skip_save_unprocessed_output=False,
skip_save_predictions=False,
skip_save_eval_stats=False,
output_directory="results",
gpus=None,
gpu_memory_limit=None,
allow_parallel_threads=True,
callbacks=None,
tune_callbacks=None,
backend=None,
random_seed=default_random_seed,
debug=False,
hyperopt_log_verbosity=3,
**kwargs,
) -> HyperoptResults:
if isinstance(dataset, str) and not has_remote_protocol(dataset) and not os.path.isabs(dataset):
dataset = os.path.abspath(dataset)
# Ray Tune / PyArrow requires absolute paths or URIs for storage_path
if not has_remote_protocol(output_directory) and not os.path.isabs(output_directory):
output_directory = os.path.abspath(output_directory)
if isinstance(backend, str):
backend = initialize_backend(backend)
if gpus is not None:
raise ValueError(
"Parameter `gpus` is not supported when using Ray Tune. "
"Configure GPU resources with Ray and set `gpu_resources_per_trial` in your "
"hyperopt config."
)
if gpu_memory_limit is None and 0 < self._gpu_resources_per_trial_non_none < 1:
# Enforce fractional GPU utilization
gpu_memory_limit = self.gpu_resources_per_trial
hyperopt_dict = dict(
config=config,
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
experiment_name=experiment_name,
model_name=model_name,
eval_split=self.split,
skip_save_training_description=skip_save_training_description,
skip_save_training_statistics=skip_save_training_statistics,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
skip_save_processed_input=skip_save_processed_input,
skip_save_unprocessed_output=skip_save_unprocessed_output,
skip_save_predictions=skip_save_predictions,
skip_save_eval_stats=skip_save_eval_stats,
output_directory=output_directory,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
backend=backend,
random_seed=random_seed,
debug=debug,
)
mode = "min" if self.goal != MAXIMIZE else "max"
metric = "metric_score"
# if random seed not set, use Ludwig seed
self.search_algorithm.check_for_random_seed(random_seed)
if self.search_algorithm.search_alg_dict is not None:
if TYPE not in self.search_algorithm.search_alg_dict:
candiate_search_algs = [search_alg for search_alg in SEARCH_ALG_IMPORT.keys()]
logger.warning(
"WARNING: search_alg type parameter missing, using 'variant_generator' as default. "
f"These are possible values for the type parameter: {candiate_search_algs}."
)
search_alg = None
else:
search_alg_type = self.search_algorithm.search_alg_dict[TYPE]
search_alg = tune.create_searcher(
search_alg_type, metric=metric, mode=mode, **self.search_algorithm.search_alg_dict
)
else:
search_alg = None
if self.max_concurrent_trials:
assert (
self.max_concurrent_trials > 0
), f"`max_concurrent_trials` must be greater than 0, got {self.max_concurrent_trials}"
if isinstance(search_alg, BasicVariantGenerator) or search_alg is None:
search_alg = BasicVariantGenerator(max_concurrent=self.max_concurrent_trials)
elif isinstance(search_alg, ConcurrencyLimiter):
raise ValueError(
"You have specified `max_concurrent_trials`, but the search "
"algorithm is already a `ConcurrencyLimiter`. FIX THIS "
"by setting `max_concurrent_trials=None`."
)
else:
search_alg = ConcurrencyLimiter(search_alg, max_concurrent=self.max_concurrent_trials)
resources_per_trial = {
"cpu": self._cpu_resources_per_trial_non_none,
"gpu": self._gpu_resources_per_trial_non_none,
}
def run_experiment_trial(config, local_hyperopt_dict, checkpoint_dir=None):
return self._run_experiment(
config,
checkpoint_dir,
local_hyperopt_dict,
self.decode_ctx,
_is_ray_backend(backend),
)
tune_config = {}
_tune_callbacks = list(tune_callbacks or [])
for callback in callbacks or []:
run_experiment_trial, tune_config = callback.prepare_ray_tune(
run_experiment_trial,
tune_config,
_tune_callbacks,
)
tune_callbacks = _tune_callbacks
if _is_ray_backend(backend):
# for now, we do not do distributed training on cpu (until spread scheduling is implemented for Ray Train)
# but we do want to enable it when GPUs are specified
resources_per_trial = PlacementGroupFactory(
[{}] + ([{"CPU": 0, "GPU": 1}] * self._gpu_resources_per_trial_non_none)
if self._gpu_resources_per_trial_non_none
else [{}] + [{"CPU": self._cpu_resources_per_trial_non_none}]
)
if has_remote_protocol(output_directory):
# In Ray 2.x, remote storage is handled via RunConfig storage_path
self.sync_config = tune.SyncConfig()
self.sync_client = None
# output_directory will be used as storage_path
elif self.kubernetes_namespace:
logger.warning(
"Kubernetes-specific syncing is no longer supported in Ray 2.x. "
"Use cloud storage (S3, GCS) as the output directory instead."
)
run_experiment_trial_params = tune.with_parameters(run_experiment_trial, local_hyperopt_dict=hyperopt_dict)
@ray.remote
def _register(name, trainable):
register_trainable(name, trainable)
ray.get(_register.remote(f"trainable_func_f{hash_dict(config).decode('ascii')}", run_experiment_trial_params))
# Note that resume="AUTO" will attempt to resume the experiment if possible, and
# otherwise will start a new experiment:
# https://docs.ray.io/en/latest/tune/tutorials/tune-stopping.html
should_resume = "AUTO" if resume is None else resume
# If the output directory is an S3 path and AWS_ENDPOINT_URL is set,
# configure a custom S3 filesystem for Ray Tune. We use fsspec's s3fs
# wrapped in PyArrow's FSSpecHandler because PyArrow's native S3 C++
# client doesn't read AWS_ENDPOINT_URL and its chunked transfer encoding
# is incompatible with some S3-compatible stores (e.g. MinIO).
storage_filesystem = None
if output_directory and str(output_directory).startswith("s3://"):
endpoint_url = os.environ.get("AWS_ENDPOINT_URL")
if endpoint_url:
import pyarrow.fs
import s3fs
s3 = s3fs.S3FileSystem(
endpoint_url=endpoint_url,
key=os.environ.get("AWS_ACCESS_KEY_ID"),
secret=os.environ.get("AWS_SECRET_ACCESS_KEY"),
)
storage_filesystem = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3))
# When storage_filesystem is set, storage_path must be a plain
# path (bucket/key...), not a URI (s3://bucket/key...).
output_directory = str(output_directory).removeprefix("s3://")
try:
analysis = tune.run(
f"trainable_func_f{hash_dict(config).decode('ascii')}",
name=experiment_name,
config={
**self.search_space,
**tune_config,
},
scheduler=self.scheduler,
search_alg=search_alg,
num_samples=self.num_samples,
checkpoint_config=tune.CheckpointConfig(num_to_keep=1),
max_failures=1, # retry a trial failure once
resources_per_trial=resources_per_trial,
time_budget_s=self.time_budget_s,
sync_config=self.sync_config,
storage_path=output_directory,
storage_filesystem=storage_filesystem,
metric=metric,
mode=mode,
trial_name_creator=lambda trial: f"trial_{trial.trial_id}",
trial_dirname_creator=lambda trial: f"trial_{trial.trial_id}",
callbacks=tune_callbacks,
stop=CallbackStopper(callbacks),
verbose=hyperopt_log_verbosity,
resume=should_resume,
log_to_file=True,
)
except Exception as e:
# Explicitly raise a RuntimeError if an error is encountered during a Ray trial.
# NOTE: Cascading the exception with "raise _ from e" still results in hanging.
raise RuntimeError(f"Encountered Ray Tune error: {e}")
if "metric_score" in analysis.results_df.columns:
ordered_trials = analysis.results_df.sort_values("metric_score", ascending=self.goal != MAXIMIZE)
# Catch nans in edge case where the trial doesn't complete
temp_ordered_trials = []
for kwargs in ordered_trials.to_dict(orient="records"):
for key in ["parameters", "training_stats", "eval_stats"]:
if isinstance(kwargs[key], float):
kwargs[key] = {}
temp_ordered_trials.append(kwargs)
# Trials w/empty eval_stats fields & non-empty training_stats fields ran intermediate
# tune.report call(s) but were terminated before reporting eval_stats from post-train
# evaluation (e.g., trial stopped due to time budget or relatively poor performance.)
# For any such trials, run model evaluation for the best model in that trial & record
# results in ordered_trials which is returned & is persisted in hyperopt_statistics.json.
for trial in temp_ordered_trials:
if trial["eval_stats"] == "{}" and trial["training_stats"] != "{}":
# Evaluate the best model on the eval_split, which is validation_set
if validation_set is not None and validation_set.size > 0:
trial_path = trial["trial_dir"]
with self._get_best_model_path(trial_path, analysis) as best_model_path:
if best_model_path is not None:
try:
self._evaluate_best_model(
trial,
trial_path,
best_model_path,
validation_set,
data_format,
skip_save_unprocessed_output,
skip_save_predictions,
skip_save_eval_stats,
gpus,
gpu_memory_limit,
allow_parallel_threads,
backend,
debug,
)
except Exception:
logger.warning(
f"Failed to evaluate best model for trial {trial_path}. "
"This can happen with incomplete checkpoints from early stopping. "
f"Full exception:\n{traceback.format_exc()}"
)
else:
logger.warning("Skipping evaluation as no model checkpoints were available")
else:
logger.warning("Skipping evaluation as no validation set was provided")
ordered_trials = [TrialResults.from_dict(load_json_values(kwargs)) for kwargs in temp_ordered_trials]
else:
logger.warning("No trials reported results; check if time budget lower than epoch latency")
ordered_trials = []
return HyperoptResults(ordered_trials=ordered_trials, experiment_analysis=analysis)
class CallbackStopper(Stopper):
"""Ray Tune Stopper that triggers the entire job to stop if one callback returns True."""
def __init__(self, callbacks: list[Callback] | None):
self.callbacks = callbacks or []
def __call__(self, trial_id, result):
return False
def stop_all(self):
for callback in self.callbacks:
if callback.should_stop_hyperopt():
return True
return False
def get_build_hyperopt_executor(executor_type):
return get_from_registry(executor_type, executor_registry)
executor_registry = {"ray": RayTuneExecutor}
def set_values(params: dict[str, Any], model_dict: dict[str, Any]):
for key, value in params.items():
if isinstance(value, dict):
for sub_key, sub_value in value.items():
if key not in model_dict:
model_dict[key] = dict()
model_dict[key][sub_key] = sub_value
else:
model_dict[key] = value
def run_experiment(
config,
parameters=None,
dataset=None,
training_set=None,
validation_set=None,
test_set=None,
training_set_metadata=None,
data_format=None,
experiment_name="hyperopt",
model_name="run",
model_resume_path=None,
eval_split=VALIDATION,
skip_save_training_description=False,
skip_save_training_statistics=False,
skip_save_model=False,
skip_save_progress=False,
skip_save_log=False,
skip_save_processed_input=False,
skip_save_unprocessed_output=False,
skip_save_predictions=False,
skip_save_eval_stats=False,
output_directory="results",
gpus=None,
gpu_memory_limit=None,
allow_parallel_threads=True,
callbacks=None,
backend=None,
random_seed=default_random_seed,
debug=False,
**kwargs,
):
for callback in callbacks or []:
callback.on_hyperopt_trial_start(parameters)
# Collect training and validation losses and metrics
# & append it to `results`
model = LudwigModel(
config=config,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
)
eval_stats, train_stats, _, _ = model.experiment(
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
experiment_name=experiment_name,
model_name=model_name,
model_resume_path=model_resume_path,
eval_split=eval_split,
skip_save_training_description=skip_save_training_description,
skip_save_training_statistics=skip_save_training_statistics,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
skip_save_processed_input=skip_save_processed_input,
skip_save_unprocessed_output=skip_save_unprocessed_output,
skip_save_predictions=skip_save_predictions,
skip_save_eval_stats=skip_save_eval_stats,
output_directory=output_directory,
skip_collect_predictions=True,
skip_collect_overall_stats=False,
random_seed=random_seed,
debug=debug,
)
for callback in callbacks or []:
callback.on_hyperopt_trial_end(parameters)
return train_stats, eval_stats
def _run_experiment_unary(kwargs):
"""Unary function is needed by Fiber to map a list of args."""
return run_experiment(**kwargs)
================================================
FILE: ludwig/hyperopt/results.py
================================================
# !/usr/bin/env python
from dataclasses import dataclass
from typing import Any
from dataclasses_json import dataclass_json
try:
from ray.tune import ExperimentAnalysis
except ImportError:
ExperimentAnalysis = Any
@dataclass_json
@dataclass
class TrialResults:
parameters: dict
metric_score: float
training_stats: dict
eval_stats: dict
@dataclass
class HyperoptResults:
ordered_trials: list[TrialResults]
experiment_analysis: ExperimentAnalysis
================================================
FILE: ludwig/hyperopt/run.py
================================================
import copy
import logging
import os
from pprint import pformat
import pandas as pd
import yaml
from tabulate import tabulate
from ludwig.api import LudwigModel
from ludwig.backend import Backend, initialize_backend, LocalBackend
from ludwig.callbacks import Callback
from ludwig.constants import (
AUTO,
COMBINED,
EXECUTOR,
GOAL,
HYPEROPT,
LOSS,
MAX_CONCURRENT_TRIALS,
METRIC,
NAME,
OUTPUT_FEATURES,
PARAMETERS,
PREPROCESSING,
SEARCH_ALG,
SPLIT,
TEST,
TRAINING,
TYPE,
VALIDATION,
)
from ludwig.data.split import get_splitter
from ludwig.hyperopt.results import HyperoptResults
from ludwig.hyperopt.utils import (
log_warning_if_all_grid_type_parameters,
print_hyperopt_results,
save_hyperopt_stats,
should_tune_preprocessing,
update_hyperopt_params_with_defaults,
)
from ludwig.schema.model_config import ModelConfig
from ludwig.utils.backward_compatibility import upgrade_config_dict_to_latest_version
from ludwig.utils.dataset_utils import generate_dataset_statistics
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.fs_utils import makedirs, open_file
try:
from ray.tune import Callback as TuneCallback
from ludwig.backend.ray import RayBackend
except ImportError:
TuneCallback = object
class RayBackend:
pass
logger = logging.getLogger(__name__)
def hyperopt(
config: str | dict,
dataset: str | dict | pd.DataFrame = None,
training_set: str | dict | pd.DataFrame = None,
validation_set: str | dict | pd.DataFrame = None,
test_set: str | dict | pd.DataFrame = None,
training_set_metadata: str | dict = None,
data_format: str = None,
experiment_name: str = "hyperopt",
model_name: str = "run",
resume: bool | None = None,
skip_save_training_description: bool = False,
skip_save_training_statistics: bool = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
skip_save_processed_input: bool = True,
skip_save_unprocessed_output: bool = False,
skip_save_predictions: bool = False,
skip_save_eval_stats: bool = False,
skip_save_hyperopt_statistics: bool = False,
output_directory: str = "results",
gpus: str | int | list[int] = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
callbacks: list[Callback] = None,
tune_callbacks: list[TuneCallback] = None,
backend: Backend | str = None,
random_seed: int = default_random_seed,
hyperopt_log_verbosity: int = 3,
**kwargs,
) -> HyperoptResults:
"""This method performs an hyperparameter optimization.
# Inputs
:param config: (Union[str, dict]) config which defines
the different parameters of the model, features, preprocessing and
training. If `str`, filepath to yaml configuration file.
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used in the experiment.
If it has a split column, it will be used for splitting (0 for train,
1 for validation, 2 for test), otherwise the dataset will be
randomly split.
:param training_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing training data.
:param validation_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing validation data.
:param test_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing test data.
:param training_set_metadata: (Union[str, dict], default: `None`)
metadata JSON file or loaded metadata. Intermediate preprocessed
structure containing the mappings of the input
dataset created the first time an input file is used in the same
directory with the same name and a '.meta.json' extension.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'df'`, `'dict'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param experiment_name: (str, default: `'experiment'`) name for
the experiment.
:param model_name: (str, default: `'run'`) name of the model that is
being used.
:param resume: (bool) If true, continue hyperopt from the state of the previous
run in the output directory with the same experiment name. If false, will create
new trials, ignoring any previous state, even if they exist in the output_directory.
By default, will attempt to resume if there is already an existing experiment with
the same name, and will create new trials if not.
:param skip_save_training_description: (bool, default: `False`) disables
saving the description JSON file.
:param skip_save_training_statistics: (bool, default: `False`) disables
saving training statistics JSON file.
:param skip_save_model: (bool, default: `False`) disables
saving model weights and hyperparameters each time the model
improves. By default Ludwig saves model weights after each epoch
the validation metric improves, but if the model is really big
that can be time consuming. If you do not want to keep
the weights and just find out what performance a model can get
with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on and the returned model
will have the weights obtained at the end of training, instead of
the weights of the epoch with the best validation performance.
:param skip_save_progress: (bool, default: `False`) disables saving
progress each epoch. By default Ludwig saves weights and stats
after each epoch for enabling resuming of training, but if
the model is really big that can be time consuming and will uses
twice as much space, use this parameter to skip it, but training
cannot be resumed later on.
:param skip_save_log: (bool, default: `False`) disables saving
TensorBoard logs. By default Ludwig saves logs for the TensorBoard,
but if it is not needed turning it off can slightly increase the
overall speed.
:param skip_save_processed_input: (bool, default: `False`) if input
dataset is provided it is preprocessed and cached by saving an HDF5
and JSON files to avoid running the preprocessing again. If this
parameter is `False`, the HDF5 and JSON file are not saved.
:param skip_save_unprocessed_output: (bool, default: `False`) by default
predictions and their probabilities are saved in both raw
unprocessed numpy files containing tensors and as postprocessed
CSV files (one for each output feature). If this parameter is True,
only the CSV ones are saved and the numpy ones are skipped.
:param skip_save_predictions: (bool, default: `False`) skips saving test
predictions CSV files.
:param skip_save_eval_stats: (bool, default: `False`) skips saving test
statistics JSON file.
:param skip_save_hyperopt_statistics: (bool, default: `False`) skips saving
hyperopt stats file.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param gpus: (list, default: `None`) list of GPUs that are available
for training.
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow PyTorch
to use multithreading parallelism to improve performance at
the cost of determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param random_seed: (int: default: 42) random seed used for weights
initialization, splits and any other random function.
:param hyperopt_log_verbosity: (int: default: 3) controls verbosity of
ray tune log messages. Valid values: 0 = silent, 1 = only status updates,
2 = status and brief trial results, 3 = status and detailed trial results.
# Return
:return: (List[dict]) List of results for each trial, ordered by
descending performance on the target metric.
"""
from ludwig.hyperopt.execution import get_build_hyperopt_executor, RayTuneExecutor
# check if config is a path or a dict
if isinstance(config, str): # assume path
with open_file(config, "r") as def_file:
config_dict = yaml.safe_load(def_file)
else:
config_dict = config
if HYPEROPT not in config_dict:
raise ValueError("Hyperopt Section not present in config")
# backwards compatibility
upgraded_config = upgrade_config_dict_to_latest_version(config_dict)
# Initialize config object
config_obj = ModelConfig.from_dict(upgraded_config)
# Retain pre-merged config for hyperopt schema generation
premerged_config = copy.deepcopy(upgraded_config)
# Get full config with defaults
full_config = config_obj.to_dict() # TODO (Connor): Refactor to use config object
hyperopt_config = full_config[HYPEROPT]
# Explicitly default to a local backend to avoid picking up Ray
# backend from the environment.
backend = backend or config_dict.get("backend") or "local"
backend = initialize_backend(backend)
update_hyperopt_params_with_defaults(hyperopt_config)
# Check if all features are grid type parameters and log UserWarning if needed
log_warning_if_all_grid_type_parameters(hyperopt_config)
# Infer max concurrent trials
if hyperopt_config[EXECUTOR].get(MAX_CONCURRENT_TRIALS) == AUTO:
hyperopt_config[EXECUTOR][MAX_CONCURRENT_TRIALS] = backend.max_concurrent_trials(hyperopt_config)
logger.info(f"Setting max_concurrent_trials to {hyperopt_config[EXECUTOR][MAX_CONCURRENT_TRIALS]}")
# Print hyperopt config
logger.info("Hyperopt Config")
logger.info(pformat(hyperopt_config, indent=4))
logger.info("\n")
search_alg = hyperopt_config[SEARCH_ALG]
executor = hyperopt_config[EXECUTOR]
parameters = hyperopt_config[PARAMETERS]
split = hyperopt_config[SPLIT]
output_feature = hyperopt_config["output_feature"]
metric = hyperopt_config[METRIC]
goal = hyperopt_config[GOAL]
######################
# check validity of output_feature / metric/ split combination
######################
splitter = get_splitter(**full_config[PREPROCESSING]["split"])
if split == TRAINING:
if training_set is None and not splitter.has_split(0):
raise ValueError(
'The data for the specified split for hyperopt "{}" '
"was not provided, "
"or the split amount specified in the preprocessing section "
"of the config is not greater than 0".format(split)
)
elif split == VALIDATION:
if validation_set is None and not splitter.has_split(1):
raise ValueError(
'The data for the specified split for hyperopt "{}" '
"was not provided, "
"or the split amount specified in the preprocessing section "
"of the config is not greater than 0".format(split)
)
elif split == TEST:
if test_set is None and not splitter.has_split(2):
raise ValueError(
'The data for the specified split for hyperopt "{}" '
"was not provided, "
"or the split amount specified in the preprocessing section "
"of the config is not greater than 0".format(split)
)
else:
raise ValueError(
'unrecognized hyperopt split "{}". ' "Please provide one of: {}".format(split, {TRAINING, VALIDATION, TEST})
)
if output_feature == COMBINED:
if metric != LOSS:
raise ValueError('The only valid metric for "combined" output feature is "loss"')
else:
output_feature_names = {of[NAME] for of in full_config[OUTPUT_FEATURES]}
if output_feature not in output_feature_names:
raise ValueError(
'The output feature specified for hyperopt "{}" '
"cannot be found in the config. "
'Available ones are: {} and "combined"'.format(output_feature, output_feature_names)
)
hyperopt_executor = get_build_hyperopt_executor(executor[TYPE])(
parameters, output_feature, metric, goal, split, search_alg=search_alg, **executor
)
# Explicitly default to a local backend to avoid picking up Ray
# backend from the environment.
backend = backend or config_dict.get("backend") or "local"
backend = initialize_backend(backend)
if not (
isinstance(backend, LocalBackend)
or (isinstance(hyperopt_executor, RayTuneExecutor) and isinstance(backend, RayBackend))
):
raise ValueError(
"Hyperopt requires using a `local` backend at this time, or " "`ray` backend with `ray` executor."
)
for callback in callbacks or []:
callback.on_hyperopt_init(experiment_name)
if not should_tune_preprocessing(full_config):
# preprocessing is not being tuned, so generate it once before starting trials
for callback in callbacks or []:
callback.on_hyperopt_preprocessing_start(experiment_name)
model = LudwigModel(
config=full_config,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
)
training_set, validation_set, test_set, training_set_metadata = model.preprocess(
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
skip_save_processed_input=skip_save_processed_input,
random_seed=random_seed,
)
dataset = None
dataset_statistics = generate_dataset_statistics(training_set, validation_set, test_set)
logger.info("\nDataset Statistics")
logger.info(tabulate(dataset_statistics, headers="firstrow", tablefmt="fancy_grid"))
for callback in callbacks or []:
callback.on_hyperopt_preprocessing_end(experiment_name)
for callback in callbacks or []:
callback.on_hyperopt_start(experiment_name)
hyperopt_results = hyperopt_executor.execute(
premerged_config,
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
experiment_name=experiment_name,
model_name=model_name,
resume=resume,
skip_save_training_description=skip_save_training_description,
skip_save_training_statistics=skip_save_training_statistics,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
skip_save_processed_input=skip_save_processed_input,
skip_save_unprocessed_output=skip_save_unprocessed_output,
skip_save_predictions=skip_save_predictions,
skip_save_eval_stats=skip_save_eval_stats,
output_directory=output_directory,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
tune_callbacks=tune_callbacks,
backend=backend,
random_seed=random_seed,
hyperopt_log_verbosity=hyperopt_log_verbosity,
**kwargs,
)
if backend.is_coordinator():
print_hyperopt_results(hyperopt_results)
if not skip_save_hyperopt_statistics:
with backend.storage.artifacts.use_credentials():
results_directory = os.path.join(output_directory, experiment_name)
makedirs(results_directory, exist_ok=True)
hyperopt_stats = {
"hyperopt_config": hyperopt_config,
"hyperopt_results": [t.to_dict() for t in hyperopt_results.ordered_trials],
}
save_hyperopt_stats(hyperopt_stats, results_directory)
logger.info(f"Hyperopt stats saved to: {results_directory}")
for callback in callbacks or []:
callback.on_hyperopt_end(experiment_name)
callback.on_hyperopt_finish(experiment_name)
logger.info("Finished hyperopt")
return hyperopt_results
================================================
FILE: ludwig/hyperopt/search_algos.py
================================================
import logging
from abc import ABC
from importlib import import_module
from ludwig.constants import TYPE
from ludwig.utils.misc_utils import get_from_registry
logger = logging.getLogger(__name__)
def _is_package_installed(package_name: str, search_algo_name: str) -> bool:
try:
import_module(package_name)
return True
except ImportError:
raise ImportError(
f"Search algorithm {search_algo_name} requires package {package_name}, however package is not installed."
" Please refer to Ray Tune documentation for packages required for this search algorithm."
)
class SearchAlgorithm(ABC):
def __init__(self, search_alg_dict: dict) -> None:
self.search_alg_dict = search_alg_dict
self.random_seed_attribute_name = None
def check_for_random_seed(self, ludwig_random_seed: int) -> None:
if self.random_seed_attribute_name not in self.search_alg_dict:
self.search_alg_dict[self.random_seed_attribute_name] = ludwig_random_seed
class BasicVariantSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
super().__init__(search_alg_dict)
self.random_seed_attribute_name = "random_state"
class HyperoptSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("hyperopt", "hyperopt")
super().__init__(search_alg_dict)
self.random_seed_attribute_name = "random_state_seed"
class BOHBSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("hpbandster", "bohb")
_is_package_installed("ConfigSpace", "bohb")
super().__init__(search_alg_dict)
self.random_seed_attribute_name = "seed"
class AxSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("sqlalchemy", "ax")
_is_package_installed("ax", "ax")
super().__init__(search_alg_dict)
# override parent method, this search algorithm does not support
# setting random seed
def check_for_random_seed(self, ludwig_random_seed: int) -> None:
pass
class BayesOptSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("bayes_opt", "bayesopt")
super().__init__(search_alg_dict)
self.random_seed_attribute_name = "random_state"
class BlendsearchSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("flaml", "blendsearch")
super().__init__(search_alg_dict)
# override parent method, this search algorithm does not support
# setting random seed
def check_for_random_seed(self, ludwig_random_seed: int) -> None:
pass
class CFOSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("flaml", "cfo")
super().__init__(search_alg_dict)
self.random_seed_attribute_name = "seed"
# override parent method, this search algorithm does not support
# setting random seed
def check_for_random_seed(self, ludwig_random_seed: int) -> None:
pass
class DragonflySA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("dragonfly", "dragonfly")
super().__init__(search_alg_dict)
self.random_seed_attribute_name = "random_state_seed"
class HEBOSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("hebo", "hebo")
super().__init__(search_alg_dict)
self.random_seed_attribute_name = "random_state_seed"
class SkoptSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("skopt", "skopt")
super().__init__(search_alg_dict)
# override parent method, this search algorithm does not support
# setting random seed
def check_for_random_seed(self, ludwig_random_seed: int) -> None:
pass
class NevergradSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("nevergrad", "nevergrad")
super().__init__(search_alg_dict)
# override parent method, this search algorithm does not support
# setting random seed
def check_for_random_seed(self, ludwig_random_seed: int) -> None:
pass
class OptunaSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("optuna", "optuna")
super().__init__(search_alg_dict)
self.random_seed_attribute_name = "seed"
class ZooptSA(SearchAlgorithm):
def __init__(self, search_alg_dict: dict) -> None:
_is_package_installed("zoopt", "zoopt")
super().__init__(search_alg_dict)
# override parent method, this search algorithm does not support
# setting random seed
def check_for_random_seed(self, ludwig_random_seed: int) -> None:
pass
def get_search_algorithm(search_algo):
search_algo_name = search_algo.get(TYPE, None)
return get_from_registry(search_algo_name, search_algo_registry)(search_algo)
search_algo_registry = {
None: BasicVariantSA,
"variant_generator": BasicVariantSA,
"random": BasicVariantSA,
"hyperopt": HyperoptSA,
"bohb": BOHBSA,
"ax": AxSA,
"bayesopt": BayesOptSA,
"blendsearch": BlendsearchSA,
"cfo": CFOSA,
"dragonfly": DragonflySA,
"hebo": HEBOSA,
"skopt": SkoptSA,
"nevergrad": NevergradSA,
"optuna": OptunaSA,
"zoopt": ZooptSA,
}
================================================
FILE: ludwig/hyperopt/utils.py
================================================
import copy
import dataclasses
import json
import logging
import os
import warnings
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
AUTO,
COMBINED,
EXECUTOR,
GOAL,
GRID_SEARCH,
HYPEROPT,
INPUT_FEATURES,
LOSS,
MAX_CONCURRENT_TRIALS,
METRIC,
MINIMIZE,
NAME,
NUM_SAMPLES,
OUTPUT_FEATURES,
PARAMETERS,
PREPROCESSING,
RAY,
SPACE,
SPLIT,
TYPE,
VALIDATION,
)
from ludwig.globals import HYPEROPT_STATISTICS_FILE_NAME
from ludwig.hyperopt.results import HyperoptResults, TrialResults
from ludwig.types import HyperoptConfigDict, ModelConfigDict
from ludwig.utils.data_utils import save_json
from ludwig.utils.misc_utils import (
get_class_attributes,
get_from_registry,
merge_dict,
set_default_value,
set_default_values,
)
from ludwig.utils.print_utils import print_boxed
logger = logging.getLogger(__name__)
def print_hyperopt_results(hyperopt_results: HyperoptResults):
print_boxed("HYPEROPT RESULTS", print_fun=logger.info)
for trial_results in hyperopt_results.ordered_trials:
if not isinstance(trial_results.metric_score, str):
logger.info(f"score: {trial_results.metric_score:.6f} | parameters: {trial_results.parameters}")
logger.info("")
def save_hyperopt_stats(hyperopt_stats, hyperopt_dir_name):
hyperopt_stats_fn = os.path.join(hyperopt_dir_name, HYPEROPT_STATISTICS_FILE_NAME)
save_json(hyperopt_stats_fn, hyperopt_stats)
def load_json_value(v):
try:
return json.loads(v)
except Exception as e:
logger.warning(f"While loading json, encountered exception: {e}")
return v
# define set containing names to return for TrialResults
TRIAL_RESULTS_NAMES_SET = {f.name for f in dataclasses.fields(TrialResults)}
def load_json_values(d):
# ensure metric_score is a string for the json load to eliminate extraneous exception message
d["metric_score"] = str(d["metric_score"])
# load only data required for TrialResults
return {k: load_json_value(v) for k, v in d.items() if k in TRIAL_RESULTS_NAMES_SET}
def should_tune_preprocessing(config):
parameters = config[HYPEROPT][PARAMETERS]
for param_name in parameters.keys():
if f"{PREPROCESSING}." in param_name:
return True
return False
def parameter_to_dict(name, value):
if name == ".":
# Parameter name ".", means top-level config
return value
parameter_dict = {}
curr_dict = parameter_dict
name_list = name.split(".")
for i, name_elem in enumerate(name_list):
if i == len(name_list) - 1:
curr_dict[name_elem] = value
else:
name_dict = curr_dict.get(name_elem, {})
curr_dict[name_elem] = name_dict
curr_dict = name_dict
return parameter_dict
def feature_list_to_dict(config: ModelConfigDict) -> ModelConfigDict:
input_features_dict = {}
for feature in config[INPUT_FEATURES]:
input_features_dict[feature[NAME]] = feature
output_features_dict = {}
for feature in config[OUTPUT_FEATURES]:
output_features_dict[feature[NAME]] = feature
config = copy.copy(config)
config[INPUT_FEATURES] = input_features_dict
config[OUTPUT_FEATURES] = output_features_dict
return config
def feature_dict_to_list(config: ModelConfigDict) -> ModelConfigDict:
# This works because Python dicts are order-preserving, so we do not need to
# do anything special to map from a key in the dict to an index in a list
input_features_list = []
for feature in config[INPUT_FEATURES].values():
input_features_list.append(feature)
output_features_list = []
for feature in config[OUTPUT_FEATURES].values():
output_features_list.append(feature)
config = copy.copy(config)
config[INPUT_FEATURES] = input_features_list
config[OUTPUT_FEATURES] = output_features_list
return config
def substitute_parameters(
config: ModelConfigDict,
parameters: dict[str, Any],
):
"""Update Ludwig config with parameters sampled from the Hyperopt sampler."""
# Collect the sets of names for each feature grouping so we can map feature names to
# groups
input_feature_names = {feature[NAME] for feature in config[INPUT_FEATURES]}
output_feature_names = {feature[NAME] for feature in config[OUTPUT_FEATURES]}
# Features in the user config are provided as a list, but in hyperopt we reference
# features by name, so convert temporarily to a dict to simplify the mergep process.
config = feature_list_to_dict(config)
# Merge parameters into the user configuration in order. As such, if there are conflicting
# params, the later params will take precedence.
for name, value in parameters.items():
# User params are provided as ., but we group input / output features
# together during the merge to make it easier and unambiguous to convert back and forth
# TODO(travis): we should revisit the user format here, as it silently breaks situations
# where the user has a feature named "trainer", "combiner", etc.
prefix = name.split(".")[0]
if prefix in input_feature_names:
name = f"{INPUT_FEATURES}.{name}"
elif prefix in output_feature_names:
name = f"{OUTPUT_FEATURES}.{name}"
param_dict = parameter_to_dict(name, value)
config = merge_dict(config, param_dict)
# Now that all features have been merged, convert back to the original list format.
config = feature_dict_to_list(config)
return config
@DeveloperAPI
def get_num_duplicate_trials(hyperopt_config: HyperoptConfigDict) -> int:
"""Returns the number of duplicate trials that will be created.
Duplicate trials are only created when there are grid type parameters and num_samples > 1.
"""
num_samples = hyperopt_config[EXECUTOR].get(NUM_SAMPLES, 1)
if num_samples == 1:
return 0
total_grid_search_trials = 1
for _, param_info in hyperopt_config[PARAMETERS].items():
if param_info.get(SPACE, None) == GRID_SEARCH:
total_grid_search_trials *= len(param_info.get("values", []))
num_duplicate_trials = (total_grid_search_trials * num_samples) - total_grid_search_trials
return num_duplicate_trials
def log_warning_if_all_grid_type_parameters(hyperopt_config: HyperoptConfigDict) -> None:
"""Logs warning if all parameters have a grid type search space and num_samples > 1 since this will result in
duplicate trials being created."""
num_duplicate_trials = get_num_duplicate_trials(hyperopt_config)
if num_duplicate_trials == 0:
return
num_samples = hyperopt_config[EXECUTOR].get(NUM_SAMPLES, 1)
warnings.warn(
"All hyperopt parameters in Ludwig config are using grid_search space, but number of samples "
f"({num_samples}) is greater than 1. This will result in {num_duplicate_trials} duplicate trials being "
"created. Consider setting `num_samples` to 1 in the hyperopt executor to prevent trial duplication.",
RuntimeWarning,
)
def update_hyperopt_params_with_defaults(hyperopt_params: HyperoptConfigDict) -> None:
"""Updates user's Ludwig config with default hyperopt parameters."""
from ludwig.hyperopt.execution import executor_registry
set_default_value(hyperopt_params, EXECUTOR, {})
set_default_value(hyperopt_params, SPLIT, VALIDATION)
set_default_value(hyperopt_params, "output_feature", COMBINED)
set_default_value(hyperopt_params, METRIC, LOSS)
set_default_value(hyperopt_params, GOAL, MINIMIZE)
set_default_values(
hyperopt_params[EXECUTOR],
{TYPE: RAY, NUM_SAMPLES: 1, MAX_CONCURRENT_TRIALS: AUTO},
)
if hyperopt_params[EXECUTOR].get("trial_driver_resources") is None:
hyperopt_params[EXECUTOR]["trial_driver_resources"] = {"CPU": 1, "GPU": 0}
executor = get_from_registry(hyperopt_params[EXECUTOR][TYPE], executor_registry)
executor_defaults = {k: v for k, v in executor.__dict__.items() if k in get_class_attributes(executor)}
set_default_values(
hyperopt_params[EXECUTOR],
executor_defaults,
)
================================================
FILE: ludwig/hyperopt_cli.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import sys
from ludwig.backend import ALL_BACKENDS, Backend, initialize_backend
from ludwig.callbacks import Callback
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.hyperopt.run import hyperopt
from ludwig.utils.data_utils import load_config_from_str, load_yaml
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig
logger = logging.getLogger(__name__)
def hyperopt_cli(
config: str | dict,
dataset: str = None,
training_set: str = None,
validation_set: str = None,
test_set: str = None,
training_set_metadata: str = None,
data_format: str = None,
experiment_name: str = "experiment",
model_name: str = "run",
# model_load_path=None,
# model_resume_path=None,
skip_save_training_description: bool = False,
skip_save_training_statistics: bool = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
skip_save_processed_input: bool = False,
skip_save_unprocessed_output: bool = False,
skip_save_predictions: bool = False,
skip_save_eval_stats: bool = False,
skip_save_hyperopt_statistics: bool = False,
output_directory: str = "results",
gpus: str | int | list[int] = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
callbacks: list[Callback] = None,
backend: Backend | str = None,
random_seed: int = default_random_seed,
hyperopt_log_verbosity: int = 3,
**kwargs,
):
"""Searches for optimal hyperparameters.
# Inputs
:param config: (Union[str, dict]) in-memory representation of
config or string path to a YAML config file.
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used for training.
If it has a split column, it will be used for splitting (0 for train,
1 for validation, 2 for test), otherwise the dataset will be
randomly split.
:param training_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing training data.
:param validation_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing validation data.
:param test_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing test data.
:param training_set_metadata: (Union[str, dict], default: `None`)
metadata JSON file or loaded metadata. Intermediate preprocessed
structure containing the mappings of the input
dataset created the first time an input file is used in the same
directory with the same name and a '.meta.json' extension.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param experiment_name: (str, default: `'experiment'`) name for
the experiment.
:param model_name: (str, default: `'run'`) name of the model that is
being used.
:param skip_save_training_description: (bool, default: `False`) disables
saving the description JSON file.
:param skip_save_training_statistics: (bool, default: `False`) disables
saving training statistics JSON file.
:param skip_save_model: (bool, default: `False`) disables
saving model weights and hyperparameters each time the model
improves. By default Ludwig saves model weights after each epoch
the validation metric improves, but if the model is really big
that can be time consuming. If you do not want to keep
the weights and just find out what performance a model can get
with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on and the returned model
will have the weights obtained at the end of training, instead of
the weights of the epoch with the best validation performance.
:param skip_save_progress: (bool, default: `False`) disables saving
progress each epoch. By default Ludwig saves weights and stats
after each epoch for enabling resuming of training, but if
the model is really big that can be time consuming and will uses
twice as much space, use this parameter to skip it, but training
cannot be resumed later on.
:param skip_save_log: (bool, default: `False`) disables saving
TensorBoard logs. By default Ludwig saves logs for the TensorBoard,
but if it is not needed turning it off can slightly increase the
overall speed.
:param skip_save_processed_input: (bool, default: `False`) if input
dataset is provided it is preprocessed and cached by saving an HDF5
and JSON files to avoid running the preprocessing again. If this
parameter is `False`, the HDF5 and JSON file are not saved.
:param skip_save_unprocessed_output: (bool, default: `False`) by default
predictions and their probabilities are saved in both raw
unprocessed numpy files containing tensors and as postprocessed
CSV files (one for each output feature). If this parameter is True,
only the CSV ones are saved and the numpy ones are skipped.
:param skip_save_predictions: (bool, default: `False`) skips saving test
predictions CSV files
:param skip_save_eval_stats: (bool, default: `False`) skips saving test
statistics JSON file
:param skip_save_hyperopt_statistics: (bool, default: `False`) skips saving
hyperopt stats file.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param gpus: (list, default: `None`) list of GPUs that are available
for training.
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow PyTorch
to use multithreading parallelism to improve performance at
the cost of determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param random_seed: (int: default: 42) random seed used for weights
initialization, splits and any other random function.
:param hyperopt_log_verbosity: (int: default: 3) Controls verbosity of ray tune log messages. Valid values:
0 = silent, 1 = only status updates, 2 = status and brief trial
results, 3 = status and detailed trial results.
# Return
:return" (`None`)
"""
return hyperopt(
config=config,
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
experiment_name=experiment_name,
model_name=model_name,
# model_load_path=model_load_path,
# model_resume_path=model_resume_path,
skip_save_training_description=skip_save_training_description,
skip_save_training_statistics=skip_save_training_statistics,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
skip_save_processed_input=skip_save_processed_input,
skip_save_unprocessed_output=skip_save_unprocessed_output,
skip_save_predictions=skip_save_predictions,
skip_save_eval_stats=skip_save_eval_stats,
skip_save_hyperopt_statistics=skip_save_hyperopt_statistics,
output_directory=output_directory,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
backend=backend,
random_seed=random_seed,
hyperopt_log_verbosity=hyperopt_log_verbosity,
**kwargs,
)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script searches for optimal Hyperparameters",
prog="ludwig hyperopt",
usage="%(prog)s [options]",
)
# -------------------
# Hyperopt parameters
# -------------------
parser.add_argument(
"-sshs",
"--skip_save_hyperopt_statistics",
help="skips saving hyperopt statistics file",
action="store_true",
default=False,
)
# ----------------------------
# Experiment naming parameters
# ----------------------------
parser.add_argument(
"--output_directory",
type=str,
default="results",
help="directory that contains the results",
)
parser.add_argument("--experiment_name", type=str, default="hyperopt", help="experiment name")
parser.add_argument("--model_name", type=str, default="run", help="name for the model")
# ---------------
# Data parameters
# ---------------
parser.add_argument(
"--dataset",
help="input data file path. "
"If it has a split column, it will be used for splitting "
"(0: train, 1: validation, 2: test), "
"otherwise the dataset will be randomly split",
)
parser.add_argument("--training_set", help="input train data file path")
parser.add_argument("--validation_set", help="input validation data file path")
parser.add_argument("--test_set", help="input test data file path")
parser.add_argument(
"--training_set_metadata",
help="input metadata JSON file path. An intermediate preprocessed file "
"containing the mappings of the input file created "
"the first time a file is used, in the same directory "
"with the same name and a .json extension",
)
parser.add_argument(
"--data_format",
help="format of the input data",
default="auto",
choices=[
"auto",
"csv",
"excel",
"feather",
"fwf",
"hdf5",
"html" "tables",
"json",
"jsonl",
"parquet",
"pickle",
"sas",
"spss",
"stata",
"tsv",
],
)
parser.add_argument(
"-sspi",
"--skip_save_processed_input",
help="skips saving intermediate HDF5 and JSON files",
action="store_true",
default=False,
)
# ----------------
# Model parameters
# ----------------
config = parser.add_mutually_exclusive_group(required=True)
config.add_argument(
"-c",
"--config",
type=load_yaml,
help="Path to the YAML file containing the model configuration",
)
config.add_argument(
"-cs",
"--config_str",
dest="config",
type=load_config_from_str,
help="JSON or YAML serialized string of the model configuration",
)
parser.add_argument(
"-mlp",
"--model_load_path",
help="path of a pretrained model to load as initialization",
)
parser.add_argument(
"-mrp",
"--model_resume_path",
help="path of the model directory to resume training of",
)
parser.add_argument(
"-sstd",
"--skip_save_training_description",
action="store_true",
default=False,
help="disables saving the description JSON file",
)
parser.add_argument(
"-ssts",
"--skip_save_training_statistics",
action="store_true",
default=False,
help="disables saving training statistics JSON file",
)
parser.add_argument(
"-ssm",
"--skip_save_model",
action="store_true",
default=False,
help="disables saving weights each time the model improves. "
"By default Ludwig saves weights after each epoch "
"the validation metric (improves, but if the model is really big "
"that can be time consuming. If you do not want to keep "
"the weights and just find out what performance a model can get "
"with a set of hyperparameters, use this parameter to skip it",
)
parser.add_argument(
"-ssp",
"--skip_save_progress",
action="store_true",
default=False,
help="disables saving weights after each epoch. By default ludwig saves "
"weights after each epoch for enabling resuming of training, but "
"if the model is really big that can be time consuming and will "
"save twice as much space, use this parameter to skip it",
)
parser.add_argument(
"-ssl",
"--skip_save_log",
action="store_true",
default=False,
help="disables saving TensorBoard logs. By default Ludwig saves "
"logs for the TensorBoard, but if it is not needed turning it off "
"can slightly increase the overall speed",
)
# ------------------
# Runtime parameters
# ------------------
parser.add_argument(
"-rs",
"--random_seed",
type=int,
default=42,
help="a random seed that is going to be used anywhere there is a call "
"to a random number generator: data splitting, parameter "
"initialization and training set shuffling",
)
parser.add_argument(
"-hlv",
"--hyperopt_log_verbosity",
type=int,
default=3,
choices=[0, 1, 2, 3],
help="Controls verbosity of ray tune log messages. Valid values: "
"0 = silent, 1 = only status updates, 2 = status and brief trial "
"results, 3 = status and detailed trial results.",
)
parser.add_argument("-g", "--gpus", nargs="+", type=int, default=None, help="list of gpus to use")
parser.add_argument(
"-gml",
"--gpu_memory_limit",
type=float,
default=None,
help="maximum memory fraction [0, 1] allowed to allocate per GPU device",
)
parser.add_argument(
"-b",
"--backend",
help="specifies backend to use for parallel / distributed execution, " "defaults to local execution",
choices=ALL_BACKENDS,
)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("hyperopt", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.hyperopt")
args.backend = initialize_backend(args.backend or args.config.get("backend"))
if args.backend.is_coordinator():
print_ludwig("Hyperopt", LUDWIG_VERSION)
hyperopt_cli(**vars(args))
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/model_export/base_model_exporter.py
================================================
from abc import ABC, abstractmethod
import torch
class LudwigTorchWrapper(torch.nn.Module):
"""Base class that establishes the contract for exporting to different file formats."""
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, x):
return self.model({"image_path": x})
class BaseModelExporter(ABC):
@abstractmethod
def export(self, model_path, export_path, export_args_override):
pass
@abstractmethod
def check_model_export(self, path):
pass
================================================
FILE: ludwig/model_export/onnx_exporter.py
================================================
import os
import torch
from ludwig.api import LudwigModel
from ludwig.model_export.base_model_exporter import BaseModelExporter, LudwigTorchWrapper
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
class OnnxExporter(BaseModelExporter):
"""Class that abstracts the convertion of torch to onnx."""
def export(self, model_path, export_path, output_model_name):
ludwig_model = LudwigModel.load(model_path)
model = LudwigTorchWrapper(ludwig_model.model) # Wrap the model
model.eval() # inference mode, is this needed.. I think onnx export does this for us
width = ludwig_model.config["input_features"][0]["preprocessing"]["width"]
height = ludwig_model.config["input_features"][0]["preprocessing"]["height"]
example_input = torch.randn(1, 3, width, height, requires_grad=True)
torch.onnx.export(
model,
example_input,
os.path.join(export_path, output_model_name),
opset_version=18,
export_params=True,
do_constant_folding=True,
input_names=["input"],
output_names=["combiner_hidden_1", "output", "combiner_hidden_2"],
)
def check_model_export(self, path):
import onnx
onnx_model = onnx.load(path)
onnx.checker.check_model(onnx_model)
================================================
FILE: ludwig/models/__init__.py
================================================
================================================
FILE: ludwig/models/base.py
================================================
import contextlib
import logging
from abc import ABCMeta, abstractmethod
from collections import OrderedDict
from typing import Any
import numpy as np
import torch
import torchmetrics
from ludwig.combiners.combiners import Combiner
from ludwig.constants import COMBINED, LOSS, NAME
from ludwig.encoders.base import Encoder
from ludwig.features.base_feature import create_passthrough_input_feature, InputFeature, ModuleWrapper, OutputFeature
from ludwig.features.feature_registries import get_input_type_registry, get_output_type_registry
from ludwig.features.feature_utils import LudwigFeatureDict
from ludwig.modules.metric_modules import LudwigMetric
from ludwig.modules.training_hooks import TrainingHook
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig, FeatureCollection
from ludwig.utils.algorithms_utils import topological_sort_feature_dependencies
from ludwig.utils.metric_utils import get_scalar_from_ludwig_metric
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.torch_utils import DEVICE, LudwigModule, reg_loss
from ludwig.utils.types import TorchDevice
logger = logging.getLogger(__name__)
class BaseModel(LudwigModule, metaclass=ABCMeta):
"""Base model for use in LudwigModule.
Implementations of this class should implement the following methods:
- type()
- forward()
"""
@staticmethod
@abstractmethod
def type() -> str:
"""Returns the model type."""
def __init__(self, random_seed: int = None):
self._random_seed = random_seed
# TODO: with change to misc_utils.set_random_seed() this may be redundant
# seems to be required for test_api.py::test_api_training_determinism
if random_seed is not None:
torch.random.manual_seed(random_seed)
super().__init__()
self.input_features = self.create_feature_dict()
self.output_features = self.create_feature_dict()
# ================ Combined loss metric ================
self._eval_loss_metric = ModuleWrapper(torchmetrics.MeanMetric())
self._eval_additional_losses_metrics = ModuleWrapper(torchmetrics.MeanMetric())
# ================ Training Hook Handles ================
self._forward_hook_handles: list[TrainingHook] = []
def create_feature_dict(self) -> LudwigFeatureDict:
"""Creates and returns a LudwigFeatureDict."""
return LudwigFeatureDict()
def to_device(self, device):
return self.to(device)
def metrics_to_device(self, device: str):
self._eval_loss_metric.module = self._eval_loss_metric.module.to(device)
self._eval_additional_losses_metrics.module = self._eval_additional_losses_metrics.module.to(device)
for feature in self.output_features.values():
feature._eval_loss_metric.module = feature._eval_loss_metric.module.to(device)
@classmethod
def build_inputs(cls, input_feature_configs: FeatureCollection[BaseInputFeatureConfig]) -> dict[str, InputFeature]:
"""Builds and returns input features in topological order."""
input_features = OrderedDict()
input_features_def = topological_sort_feature_dependencies(input_feature_configs.to_list())
for input_feature_def in input_features_def:
input_features[input_feature_def[NAME]] = cls.build_single_input(
getattr(input_feature_configs, input_feature_def[NAME]), input_features
)
return input_features
@staticmethod
def build_single_input(
feature_config: BaseInputFeatureConfig, other_input_features: dict[str, InputFeature] | None
) -> InputFeature:
"""Builds a single input feature from the input feature definition."""
logger.debug(f"Input {feature_config.type} feature {feature_config.name}")
encoder_obj = None
if feature_config.tied is not None:
tied_input_feature_name = feature_config.tied
if tied_input_feature_name in other_input_features:
encoder_obj = other_input_features[tied_input_feature_name].encoder_obj
return create_input_feature(feature_config, encoder_obj)
@classmethod
def build_outputs(
cls, output_feature_configs: FeatureCollection[BaseOutputFeatureConfig], combiner: Combiner
) -> dict[str, OutputFeature]:
"""Builds and returns output features in topological order."""
output_features_def = topological_sort_feature_dependencies(output_feature_configs.to_list())
output_features = {}
for output_feature_def in output_features_def:
# TODO(Justin): Check that the semantics of input_size align with what the combiner's output shape returns
# for seq2seq.
setattr(getattr(output_feature_configs, output_feature_def[NAME]), "input_size", combiner.output_shape[-1])
output_features[output_feature_def[NAME]] = cls.build_single_output(
getattr(output_feature_configs, output_feature_def[NAME]), output_features
)
return output_features
@staticmethod
def build_single_output(
feature_config: BaseOutputFeatureConfig, output_features: dict[str, OutputFeature] | None
) -> OutputFeature:
"""Builds a single output feature from the output feature definition."""
logger.debug(f"Output {feature_config.type} feature {feature_config.name}")
output_feature_class = get_from_registry(feature_config.type, get_output_type_registry())
output_feature_obj = output_feature_class(feature_config, output_features=output_features)
return output_feature_obj
def get_model_inputs(self):
"""Returns a dict of feature name -> sample model input."""
device = next(self.parameters()).device
inputs = {
input_feature_name: input_feature.create_sample_input().to(device)
for input_feature_name, input_feature in self.input_features.items()
}
return inputs
def get_model_size(self) -> int:
"""Returns total number of parameters in model."""
model_tensors = self.collect_weights()
total_size = 0
for tnsr in model_tensors:
total_size += tnsr[1].detach().cpu().numpy().size
return total_size
def to_torchscript(self, device: TorchDevice | None = None):
"""Converts the ECD model as a TorchScript model."""
if device is None:
device = DEVICE
self.eval()
model_inputs = self.get_model_inputs()
model_to_script = self.to(device)
model_inputs_to_script = {k: v.to(device) for k, v in model_inputs.items()}
# We set strict=False to enable dict inputs and outputs.
return torch.jit.trace(model_to_script, model_inputs_to_script, strict=False)
def save_torchscript(self, save_path, device: TorchDevice | None = None):
"""Saves the ECD model as a TorchScript model."""
if device is None:
device = DEVICE
traced = self.to_torchscript(device)
traced.save(save_path)
@property
def input_shape(self):
"""Returns the shape of the model's input."""
# TODO(justin): Remove dummy implementation. Make input_shape and output_shape functions.
return torch.Size([1, 1])
@abstractmethod
def forward(
self,
inputs: (
dict[str, torch.Tensor] | dict[str, np.ndarray] | tuple[dict[str, torch.Tensor], dict[str, torch.Tensor]]
),
mask=None,
) -> dict[str, torch.Tensor]:
"""Forward pass of the model.
Args:
inputs: Inputs to the model. Can be a dictionary of input names to
input tensors or a tuple of (inputs, targets) where inputs is
a dictionary of input names to input tensors and targets is a
dictionary of target names to target tensors.
mask: A mask for the inputs.
Returns:
A dictionary of output {feature name}::{tensor_name} -> output tensor.
"""
def predictions(self, inputs):
"""Returns the model's predictions for the given inputs."""
outputs = self(inputs)
return self.outputs_to_predictions(outputs)
def outputs_to_predictions(self, outputs: dict[str, torch.Tensor]) -> dict[str, dict[str, torch.Tensor]]:
"""Returns the model's predictions given the raw model outputs."""
predictions = {}
for of_name in self.output_features:
predictions[of_name] = self.output_features.get(of_name).predictions(outputs, of_name)
return predictions
def evaluation_step(self, inputs, targets):
"""Predict the inputs and update evaluation metrics."""
predictions = self.predictions(inputs)
self.update_metrics(targets, predictions)
return predictions
def predict_step(self, inputs):
"""Predict the inputs."""
return self.predictions(inputs)
def train_loss(
self,
targets,
predictions,
regularization_type: str | None = None,
regularization_lambda: float | None = None,
) -> tuple[torch.Tensor, dict[str, torch.Tensor]]:
"""Computes the training loss for the model.
Args:
targets: A dictionary of target names to target tensors.
predictions: A dictionary of output names to output tensors.
regularization_type: One of 'l1', 'l2', 'l1_l2', or None.
regularization_lambda: The regularization lambda.
Returns:
A tuple of the loss tensor and a dictionary of loss for every
output feature.
"""
train_loss = 0
of_train_losses = {}
for of_name, of_obj in self.output_features.items():
of_train_loss = of_obj.train_loss(targets[of_name], predictions, of_name)
train_loss += of_obj.loss.weight * of_train_loss
of_train_losses[of_name] = of_train_loss
additional_losses = self.losses()
if additional_losses:
train_loss += torch.sum(torch.stack(additional_losses)) # other losses
# Add regularization loss
if regularization_type is not None and regularization_lambda != 0:
train_loss += reg_loss(self, regularization_type, l1=regularization_lambda, l2=regularization_lambda)
return train_loss, of_train_losses
def eval_loss(self, targets, predictions):
"""Computes all evaluation losses for the model given targets and predictions.
Args:
targets: A dictionary of target names to target tensors.
predictions: A dictionary of output names to output tensors.
Returns:
A tuple of loss values for eval losses and additional losses.
"""
eval_loss = 0
for of_name, of_obj in self.output_features.items():
of_eval_loss = of_obj.eval_loss(targets[of_name], predictions[of_name])
eval_loss += of_obj.loss.weight * of_eval_loss
additional_loss = 0
additional_losses = self.losses()
if additional_losses:
additional_loss = torch.sum(torch.stack(additional_losses)) # other losses
return eval_loss, additional_loss
def update_metrics(self, targets, predictions):
"""Updates the model's metrics given targets and predictions."""
for of_name, of_obj in self.output_features.items():
of_obj.update_metrics(targets[of_name], predictions[of_name])
eval_loss, additional_losses = self.eval_loss(targets, predictions)
self.eval_loss_metric.update(eval_loss)
self.eval_additional_losses_metrics.update(additional_losses)
@property
def eval_loss_metric(self) -> LudwigMetric:
return self._eval_loss_metric.module
@eval_loss_metric.setter
def eval_loss_metric(self, value: LudwigMetric) -> None:
self._eval_loss_metric.module = value
@property
def eval_additional_losses_metrics(self) -> LudwigMetric:
return self._eval_additional_losses_metrics.module
def get_metrics(self) -> dict[str, dict[str, float]]:
"""Returns a dictionary of metrics for each output feature of the model."""
all_of_metrics = {}
for of_name, of_obj in self.output_features.items():
all_of_metrics[of_name] = of_obj.get_metrics()
all_of_metrics[COMBINED] = {
LOSS: get_scalar_from_ludwig_metric(self.eval_loss_metric)
+ get_scalar_from_ludwig_metric(self.eval_additional_losses_metrics)
}
return all_of_metrics
def reset_metrics(self):
"""Resets the model's metrics."""
for of_obj in self.output_features.values():
of_obj.reset_metrics()
self.eval_loss_metric.reset()
def collect_weights(self, tensor_names=None, **kwargs):
"""Returns named parameters filtered against `tensor_names` if not None."""
if not tensor_names:
return self.named_parameters()
# Check for bad tensor names.
weight_names = {name for name, _ in self.named_parameters()}
for name in tensor_names:
if name not in weight_names:
raise ValueError(f'Requested tensor name filter "{name}" not present in the model graph') # noqa: E713
# Apply filter.
tensor_set = set(tensor_names)
return [named_param for named_param in self.named_parameters() if named_param[0] in tensor_set]
def unskip(self):
"""Converts all skipped features into their fully encoded versions."""
@abstractmethod
def save(self, save_path: str):
"""Saves the model to the given path."""
@abstractmethod
def load(self, save_path: str):
"""Loads the model from the given path."""
@abstractmethod
def get_args(self):
"""Returns init arguments for constructing this model."""
@contextlib.contextmanager
def use_generation_config(self, generation_config: dict[str, Any]):
if generation_config is not None:
raise NotImplementedError(f"{self.__class__.__name__} does not support generation_config. ")
yield
def _activate_forward_hooks(self):
"""Activates/registers forward hooks for the model."""
def _deactivate_forward_hooks(self) -> None:
"""Deactivates/de-registers forward hooks for the model (if needed)."""
for handle in self._forward_hook_handles:
handle.deactivate_hook()
def create_input_feature(feature_config: BaseInputFeatureConfig, encoder_obj: Encoder | None) -> InputFeature:
input_feature_cls = get_from_registry(feature_config.type, get_input_type_registry())
input_feature = input_feature_cls(feature_config, encoder_obj=encoder_obj)
if not feature_config.encoder.skip:
return input_feature
return create_passthrough_input_feature(input_feature, feature_config)
================================================
FILE: ludwig/models/calibrator.py
================================================
#! /usr/bin/env python
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import numpy as np
from ludwig.backend import Backend
from ludwig.models.ecd import ECD
class Calibrator:
"""Calibrator calibrates the output probabilities of a model."""
def __init__(self, model: ECD, backend: Backend, batch_size: int = 128):
self.model = model
self.backend = backend
self.batch_size = batch_size
def calibration_enabled(self):
"""Calibration is enabled if the config requests calibration for any output feature.
If no output features have calibration enabled, the calibration phase should be skipped.
"""
return any(o.calibration_module is not None for o in self.model.output_features.values())
def train_calibration(self, dataset, dataset_name: str):
"""Calibrates model output probabilities on validation set after training.
This works well for most datasets, though it may fail for some difficult or extremely imbalanced datasets.
"""
if not self.calibration_enabled():
# Early out if no output features have calibration enabled.
return
with self.backend.create_predictor(self.model, batch_size=self.batch_size) as predictor:
metrics, predictions = predictor.batch_evaluation(
dataset, collect_predictions=True, collect_logits=True, dataset_name=dataset_name
)
dataset_df = dataset.to_df()
for output_feature in self.model.output_features.values():
if output_feature.calibration_module is not None:
feature_logits_key = f"{output_feature.feature_name}_logits"
if feature_logits_key in predictions:
feature_logits = self.backend.df_engine.compute(predictions[feature_logits_key])
feature_labels = self.backend.df_engine.compute(dataset_df[output_feature.proc_column])
output_feature.calibration_module.train_calibration(
np.stack(feature_logits.values, axis=0), np.stack(feature_labels.values, axis=0)
)
================================================
FILE: ludwig/models/ecd.py
================================================
import logging
import os
import numpy as np
import torch
from ludwig.accounting.used_tokens import get_used_tokens_for_ecd
from ludwig.combiners.combiners import create_combiner
from ludwig.constants import MODEL_ECD, MODEL_LLM, USED_TOKENS
from ludwig.globals import MODEL_WEIGHTS_FILE_NAME
from ludwig.models.base import BaseModel
from ludwig.schema.model_types.ecd import ECDModelConfig
from ludwig.utils import output_feature_utils
from ludwig.utils.augmentation_utils import AugmentationPipelines
from ludwig.utils.data_utils import clear_data_cache
from ludwig.utils.fs_utils import open_file
from ludwig.utils.state_dict_backward_compatibility import update_state_dict
from ludwig.utils.torch_utils import get_torch_device
logger = logging.getLogger(__name__)
class ECD(BaseModel):
@staticmethod
def type() -> str:
return MODEL_ECD
def __init__(
self,
config_obj: ECDModelConfig,
random_seed=None,
**_kwargs,
):
self.config_obj = config_obj
self._random_seed = random_seed
super().__init__(random_seed=self._random_seed)
# ================ Inputs ================
try:
self.input_features.update(self.build_inputs(input_feature_configs=self.config_obj.input_features))
except KeyError as e:
raise KeyError(
f"An input feature has a name that conflicts with a class attribute of torch's ModuleDict: {e}"
) from e
# ================ Combiner ================
logger.debug(f"Combiner {self.config_obj.combiner.type}")
self.combiner = create_combiner(self.config_obj.combiner, input_features=self.input_features)
# ================ Outputs ================
self.output_features.update(
self.build_outputs(output_feature_configs=self.config_obj.output_features, combiner=self.combiner)
)
# After constructing all layers, clear the cache to free up memory
clear_data_cache()
def prepare_for_training(self):
# 1/10/23: For parity with how the LLM model type sets up adapters and quantization, LLM encoders should call
# `prepare_for_training` at training time rather than at initialization. This loop searches for input features
# using the LLM encoder and calls `prepare_for_training` on those encoders only. No other changes should be
# made to the ECD model itself or any other encoders.
for feature in self.config_obj.input_features:
encoder_type = feature.encoder.type
if encoder_type == MODEL_LLM:
feature_name = feature.name
encoder = self.input_features.get(feature_name)
encoder.prepare_for_training()
def encode(
self,
inputs: (
dict[str, torch.Tensor] | dict[str, np.ndarray] | tuple[dict[str, torch.Tensor], dict[str, torch.Tensor]]
),
):
# Convert inputs to tensors.
for input_feature_name, input_values in inputs.items():
if not isinstance(input_values, torch.Tensor):
inputs[input_feature_name] = torch.from_numpy(input_values)
else:
inputs[input_feature_name] = input_values
encoder_outputs = {}
for input_feature_name, input_values in inputs.items():
encoder = self.input_features.get(input_feature_name)
encoder_output = encoder(input_values)
encoder_outputs[input_feature_name] = encoder_output
return encoder_outputs
def combine(self, encoder_outputs):
return self.combiner(encoder_outputs)
def decode(self, combiner_outputs, targets, mask):
# Invoke output features.
output_logits = {}
output_last_hidden = {}
for output_feature_name, output_feature in self.output_features.items():
# Use the presence or absence of targets to signal training or prediction.
target = targets[output_feature_name] if targets is not None else None
decoder_outputs = output_feature(combiner_outputs, output_last_hidden, mask=mask, target=target)
# Add decoder outputs to overall output dictionary.
for decoder_output_name, tensor in decoder_outputs.items():
output_feature_utils.set_output_feature_tensor(
output_logits, output_feature_name, decoder_output_name, tensor
)
# Save the hidden state of the output feature (for feature dependencies).
output_last_hidden[output_feature_name] = decoder_outputs["last_hidden"]
return output_logits
def forward(
self,
inputs: (
dict[str, torch.Tensor] | dict[str, np.ndarray] | tuple[dict[str, torch.Tensor], dict[str, torch.Tensor]]
),
mask=None,
) -> dict[str, torch.Tensor]:
"""Forward pass of the model.
Args:
inputs: Inputs to the model. Can be a dictionary of input names to
input tensors or a tuple of (inputs, targets) where inputs is
a dictionary of input names to input tensors and targets is a
dictionary of target names to target tensors.
mask: A mask for the inputs.
Returns:
A dictionary of output {feature name}::{tensor_name} -> output tensor.
"""
if isinstance(inputs, tuple):
inputs, targets = inputs
# Convert targets to tensors.
for target_feature_name, target_value in targets.items():
if not isinstance(target_value, torch.Tensor):
targets[target_feature_name] = torch.from_numpy(target_value)
else:
targets[target_feature_name] = target_value
else:
targets = None
assert list(inputs.keys()) == self.input_features.keys()
encoder_outputs = self.encode(inputs)
combiner_outputs = self.combine(encoder_outputs)
decoder_outputs = self.decode(combiner_outputs, targets, mask)
# Compute the number of used tokens.
decoder_outputs[USED_TOKENS] = get_used_tokens_for_ecd(inputs, targets)
return decoder_outputs
def unskip(self):
for k in self.input_features.keys():
self.input_features.set(k, self.input_features.get(k).unskip())
def save(self, save_path):
"""Saves the model to the given path."""
weights_save_path = os.path.join(save_path, MODEL_WEIGHTS_FILE_NAME)
torch.save(self.state_dict(), weights_save_path)
# Ensure the file is fully flushed to disk before any other process reads it
with open(weights_save_path, "rb") as f:
os.fsync(f.fileno())
def load(self, save_path):
"""Loads the model from the given path."""
weights_save_path = os.path.join(save_path, MODEL_WEIGHTS_FILE_NAME)
device = torch.device(get_torch_device())
with open_file(weights_save_path, "rb") as f:
state_dict = torch.load(f, map_location=device)
self.load_state_dict(update_state_dict(state_dict))
def get_args(self):
"""Returns init arguments for constructing this model."""
return (
self.config_obj.input_features.to_list(),
self.config_obj.combiner.to_dict(),
self.config_obj.output_features.to_list(),
self._random_seed,
)
def get_augmentation_pipelines(self) -> AugmentationPipelines:
"""Returns the augmentation pipeline for this model."""
# dictionary to hold any augmentation pipeline
augmentation_pipelines = {}
# loop through all input features and add their augmentation pipeline to the dictionary
for input_feature in self.config_obj.input_features:
# if augmentation was specified for this input feature, add AugmentationPipeline to dictionary
if input_feature.has_augmentation():
# use input feature proc_column as key because that is what is used in the Batcher
augmentation_pipelines[input_feature.proc_column] = self.input_features.get(
input_feature.name
).get_augmentation_pipeline()
return AugmentationPipelines(augmentation_pipelines)
================================================
FILE: ludwig/models/embedder.py
================================================
from collections.abc import Callable
import numpy as np
import pandas as pd
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ENCODER_OUTPUT, MODEL_ECD, NAME, PROC_COLUMN, TYPE
from ludwig.features.feature_registries import get_input_type_registry
from ludwig.features.feature_utils import LudwigFeatureDict
from ludwig.models.base import BaseModel
from ludwig.schema.features.base import BaseInputFeatureConfig, FeatureCollection
from ludwig.schema.features.utils import get_input_feature_cls
from ludwig.types import FeatureConfigDict, TrainingSetMetadataDict
from ludwig.utils.batch_size_tuner import BatchSizeEvaluator
from ludwig.utils.dataframe_utils import from_numpy_dataset
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.torch_utils import get_torch_device, LudwigModule
@DeveloperAPI
class Embedder(LudwigModule):
def __init__(self, feature_configs: list[FeatureConfigDict], metadata: TrainingSetMetadataDict):
super().__init__()
self.input_features = LudwigFeatureDict()
input_feature_configs = []
for feature in feature_configs:
feature_cls = get_from_registry(feature[TYPE], get_input_type_registry())
# TODO(travis): this assumes ECD is the selected model type. The best solution is to change the
# input params from FeatureConfigDict types to BaseInputFeatureConfig types, which will require a
# refactor of preprocessing to use the schema, not the dict types.
feature_obj = get_input_feature_cls(MODEL_ECD, feature[TYPE]).from_dict(feature)
feature_cls.update_config_with_metadata(feature_obj, metadata[feature[NAME]])
# When running prediction or eval, we need the preprocessing to use the original pretrained
# weights, which requires unsetting this field. In the future, we could avoid this by plumbing
# through the saved weights and loading them dynamically after building the model.
feature_obj.encoder.saved_weights_in_checkpoint = False
input_feature_configs.append(feature_obj)
feature_collection = FeatureCollection[BaseInputFeatureConfig](input_feature_configs)
try:
self.input_features.update(BaseModel.build_inputs(input_feature_configs=feature_collection))
except KeyError as e:
raise KeyError(
f"An input feature has a name that conflicts with a class attribute of torch's ModuleDict: {e}"
)
def forward(self, inputs: dict[str, torch.Tensor]):
encoder_outputs = {}
for input_feature_name, input_values in inputs.items():
encoder = self.input_features.get(input_feature_name)
encoder_output = encoder(input_values)
encoder_outputs[input_feature_name] = encoder_output[ENCODER_OUTPUT]
return encoder_outputs
@DeveloperAPI
def create_embed_batch_size_evaluator(
features_to_encode: list[FeatureConfigDict], metadata: TrainingSetMetadataDict
) -> BatchSizeEvaluator:
class _EmbedBatchSizeEvaluator(BatchSizeEvaluator):
def __init__(self):
embedder = Embedder(features_to_encode, metadata)
self.device = get_torch_device()
self.embedder = embedder.to(self.device)
self.embedder.eval()
def step(self, batch_size: int, global_max_sequence_length: int | None = None):
inputs = {
input_feature_name: input_feature.create_sample_input(batch_size=batch_size).to(self.device)
for input_feature_name, input_feature in self.embedder.input_features.items()
}
with torch.no_grad():
self.embedder(inputs)
return _EmbedBatchSizeEvaluator
@DeveloperAPI
def create_embed_transform_fn(
features_to_encode: list[FeatureConfigDict], metadata: TrainingSetMetadataDict
) -> Callable:
class EmbedTransformFn:
def __init__(self):
embedder = Embedder(features_to_encode, metadata)
self.device = get_torch_device()
self.embedder = embedder.to(self.device)
self.embedder.eval()
def __call__(self, df: pd.DataFrame) -> pd.DataFrame:
batch = _prepare_batch(df, features_to_encode, metadata)
name_to_proc = {i_feat.feature_name: i_feat.proc_column for i_feat in self.embedder.input_features.values()}
inputs = {
i_feat.feature_name: torch.from_numpy(np.array(batch[i_feat.proc_column], copy=True)).to(self.device)
for i_feat in self.embedder.input_features.values()
}
with torch.no_grad():
encoder_outputs = self.embedder(inputs)
encoded = {name_to_proc[k]: v.detach().cpu().float().numpy() for k, v in encoder_outputs.items()}
output_df = from_numpy_dataset(encoded)
for c in output_df.columns:
df[c] = output_df[c]
return df
return EmbedTransformFn
# TODO(travis): consolidate with implementation in data/ray.py
def _prepare_batch(
df: pd.DataFrame, features: list[FeatureConfigDict], metadata: TrainingSetMetadataDict
) -> dict[str, np.ndarray]:
batch = {}
for feature in features:
c = feature[PROC_COLUMN]
if df[c].values.dtype == "object":
# Ensure columns stacked instead of turned into np.array([np.array, ...], dtype=object) objects
batch[c] = np.stack(df[c].values)
else:
batch[c] = df[c].to_numpy()
for feature in features:
c = feature[PROC_COLUMN]
reshape = metadata.get(feature[NAME], {}).get("reshape")
if reshape is not None:
batch[c] = batch[c].reshape((-1, *reshape))
return batch
================================================
FILE: ludwig/models/inference.py
================================================
import logging
import os
from typing import Any, TYPE_CHECKING
import pandas as pd
import torch
from torch import nn
from ludwig.constants import NAME, POSTPROCESSOR, PREDICTOR, PREPROCESSOR, TYPE
from ludwig.data.postprocessing import convert_dict_to_df
from ludwig.data.preprocessing import load_metadata
from ludwig.features.feature_registries import get_input_type_registry
from ludwig.features.feature_utils import get_module_dict_key_from_name, get_name_from_module_dict_key
from ludwig.globals import MODEL_HYPERPARAMETERS_FILE_NAME, TRAIN_SET_METADATA_FILE_NAME
from ludwig.types import ModelConfigDict, TrainingSetMetadataDict
from ludwig.utils import output_feature_utils
from ludwig.utils.data_utils import load_json, save_json
from ludwig.utils.inference_utils import get_filename_from_stage, to_inference_module_input_from_dataframe
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.output_feature_utils import get_feature_name_from_concat_name, get_tensor_name_from_concat_name
from ludwig.utils.torch_utils import DEVICE
from ludwig.utils.types import TorchDevice, TorchscriptPreprocessingInput
# Prevents circular import errors from typing.
if TYPE_CHECKING:
from ludwig.models.base import BaseModel
logger = logging.getLogger(__name__)
class InferenceModule(nn.Module):
"""A nn.Module subclass that wraps the inference preprocessor, predictor, and postprocessor."""
def __init__(
self,
preprocessor: torch.jit.ScriptModule,
predictor: torch.jit.ScriptModule,
postprocessor: torch.jit.ScriptModule,
config: ModelConfigDict | None = None,
training_set_metadata: TrainingSetMetadataDict | None = None,
):
super().__init__()
self.preprocessor = preprocessor
self.predictor = predictor
self.postprocessor = postprocessor
self.config = config
# Do not remove – used by Predibase app
self.training_set_metadata = training_set_metadata
def preprocessor_forward(self, inputs: dict[str, TorchscriptPreprocessingInput]) -> dict[str, torch.Tensor]:
"""Forward pass through the preprocessor."""
return self.preprocessor(inputs)
def predictor_forward(self, preproc_inputs: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
"""Forward pass through the predictor.
Ensures that the inputs are on the correct device. The outputs are on the same device as self.predictor.
"""
for k, v in preproc_inputs.items():
preproc_inputs[k] = v.to(self.predictor.device)
with torch.no_grad(): # Ensure model params do not compute gradients
predictions_flattened = self.predictor(preproc_inputs)
return predictions_flattened
def postprocessor_forward(self, predictions_flattened: dict[str, torch.Tensor]) -> dict[str, dict[str, Any]]:
"""Forward pass through the postprocessor."""
postproc_outputs_flattened: dict[str, Any] = self.postprocessor(predictions_flattened)
# Turn flat inputs into nested predictions per feature name
postproc_outputs: dict[str, dict[str, Any]] = _unflatten_dict_by_feature_name(postproc_outputs_flattened)
return postproc_outputs
def forward(self, inputs: dict[str, TorchscriptPreprocessingInput]) -> dict[str, dict[str, Any]]:
preproc_inputs: dict[str, torch.Tensor] = self.preprocessor_forward(inputs)
predictions_flattened: dict[str, torch.Tensor] = self.predictor_forward(preproc_inputs)
postproc_outputs: dict[str, dict[str, Any]] = self.postprocessor_forward(predictions_flattened)
return postproc_outputs
@torch.jit.unused
def predict(self, dataset: pd.DataFrame, return_type: dict | pd.DataFrame = pd.DataFrame) -> pd.DataFrame | dict:
"""Predict on a batch of data with an interface similar to LudwigModel.predict."""
inputs = to_inference_module_input_from_dataframe(dataset, self.config, load_paths=True)
preds = self(inputs)
if return_type == pd.DataFrame:
preds = convert_dict_to_df(preds)
return preds, None # Second return value is for compatibility with LudwigModel.predict
@torch.jit.unused
@classmethod
def from_ludwig_model(
cls: "InferenceModule",
model: "BaseModel",
config: ModelConfigDict,
training_set_metadata: TrainingSetMetadataDict,
device: TorchDevice | None = None,
):
"""Create an InferenceModule from a trained LudwigModel."""
if device is None:
logger.info(f'No device specified. Loading using device "{DEVICE}".')
device = DEVICE
stage_to_module = _init_inference_stages_from_ludwig_model(
model, config, training_set_metadata, device=device, scripted=True
)
return cls(
stage_to_module[PREPROCESSOR],
stage_to_module[PREDICTOR],
stage_to_module[POSTPROCESSOR],
config=config,
training_set_metadata=training_set_metadata,
)
@torch.jit.unused
@classmethod
def from_directory(
cls: "InferenceModule",
directory: str,
device: TorchDevice | None = None,
):
"""Create an InferenceModule from a directory containing a model, config, and training set metadata."""
if device is None:
logger.info(f'No device specified. Loading using device "{DEVICE}".')
device = DEVICE
stage_to_module = _init_inference_stages_from_directory(directory, device=device)
config_path = os.path.join(directory, MODEL_HYPERPARAMETERS_FILE_NAME)
config = load_json(config_path) if os.path.exists(config_path) else None
metadata_path = os.path.join(directory, TRAIN_SET_METADATA_FILE_NAME)
training_set_metadata = load_metadata(metadata_path) if os.path.exists(metadata_path) else None
return cls(
stage_to_module[PREPROCESSOR],
stage_to_module[PREDICTOR],
stage_to_module[POSTPROCESSOR],
config=config,
training_set_metadata=training_set_metadata,
)
class _InferencePreprocessor(nn.Module):
"""Wraps preprocessing modules into a single nn.Module.
TODO(geoffrey): Implement torchscript-compatible feature_utils.LudwigFeatureDict to replace
get_module_dict_key_from_name and get_name_from_module_dict_key usage.
"""
def __init__(self, config: ModelConfigDict, training_set_metadata: TrainingSetMetadataDict):
super().__init__()
self.preproc_modules = nn.ModuleDict()
for feature_config in config["input_features"]:
feature_name = feature_config[NAME]
feature = get_from_registry(feature_config[TYPE], get_input_type_registry())
# prevents collisions with reserved keywords
module_dict_key = get_module_dict_key_from_name(feature_name)
self.preproc_modules[module_dict_key] = feature.create_preproc_module(training_set_metadata[feature_name])
def forward(self, inputs: dict[str, TorchscriptPreprocessingInput]) -> dict[str, torch.Tensor]:
preproc_inputs = {}
for module_dict_key, preproc in self.preproc_modules.items():
feature_name = get_name_from_module_dict_key(module_dict_key)
preproc_inputs[feature_name] = preproc(inputs[feature_name])
return preproc_inputs
class _InferencePredictor(nn.Module):
"""Wraps model forward pass + predictions into a single nn.Module.
The forward call of this module returns a flattened dictionary in order to support Triton input/output.
TODO(geoffrey): Implement torchscript-compatible feature_utils.LudwigFeatureDict to replace
get_module_dict_key_from_name and get_name_from_module_dict_key usage.
"""
def __init__(self, model: "BaseModel", device: TorchDevice):
super().__init__()
self.device = torch.device(device)
self.model = model.to_torchscript(self.device)
self.predict_modules = nn.ModuleDict()
for feature_name, feature in model.output_features.items():
# prevents collisions with reserved keywords
module_dict_key = get_module_dict_key_from_name(feature_name)
self.predict_modules[module_dict_key] = feature.prediction_module.to(device=self.device)
def forward(self, preproc_inputs: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
model_outputs = self.model(preproc_inputs)
predictions_flattened: dict[str, torch.Tensor] = {}
for module_dict_key, predict in self.predict_modules.items():
feature_name = get_name_from_module_dict_key(module_dict_key)
feature_predictions = predict(model_outputs, feature_name)
# Flatten out the predictions to support Triton input/output
for predict_key, tensor_values in feature_predictions.items():
predict_concat_key = output_feature_utils.get_feature_concat_name(feature_name, predict_key)
predictions_flattened[predict_concat_key] = tensor_values
return predictions_flattened
class _InferencePostprocessor(nn.Module):
"""Wraps postprocessing modules into a single nn.Module.
The forward call of this module returns a flattened dictionary in order to support Triton input/output.
TODO(geoffrey): Implement torchscript-compatible feature_utils.LudwigFeatureDict to replace
get_module_dict_key_from_name and get_name_from_module_dict_key usage.
"""
def __init__(self, model: "BaseModel", training_set_metadata: TrainingSetMetadataDict):
super().__init__()
self.postproc_modules = nn.ModuleDict()
for feature_name, feature in model.output_features.items():
# prevents collisions with reserved keywords
module_dict_key = get_module_dict_key_from_name(feature_name)
self.postproc_modules[module_dict_key] = feature.create_postproc_module(training_set_metadata[feature_name])
def forward(self, predictions_flattened: dict[str, torch.Tensor]) -> dict[str, Any]:
postproc_outputs_flattened: dict[str, Any] = {}
for module_dict_key, postproc in self.postproc_modules.items():
feature_name = get_name_from_module_dict_key(module_dict_key)
feature_postproc_outputs = postproc(predictions_flattened, feature_name)
# Flatten out the predictions to support Triton input/output
for postproc_key, tensor_values in feature_postproc_outputs.items():
postproc_concat_key = output_feature_utils.get_feature_concat_name(feature_name, postproc_key)
postproc_outputs_flattened[postproc_concat_key] = tensor_values
return postproc_outputs_flattened
def save_ludwig_model_for_inference(
save_path: str,
model: "BaseModel",
config: ModelConfigDict,
training_set_metadata: TrainingSetMetadataDict,
device: TorchDevice | None = None,
model_only: bool = False,
) -> None:
"""Saves a LudwigModel (a BaseModel model, config, and training_set_metadata) for inference."""
if device is None:
logger.info(f'No device specified. Saving using device "{DEVICE}".')
device = DEVICE
stage_to_filenames = {
stage: get_filename_from_stage(stage, device) for stage in [PREPROCESSOR, PREDICTOR, POSTPROCESSOR]
}
stage_to_module = _init_inference_stages_from_ludwig_model(
model, config, training_set_metadata, device, scripted=True
)
if model_only:
stage_to_module[PREDICTOR].save(os.path.join(save_path, stage_to_filenames[PREDICTOR]))
else:
config_path = os.path.join(save_path, MODEL_HYPERPARAMETERS_FILE_NAME)
if not os.path.exists(config_path):
save_json(config_path, config)
logger.info(f"Saved model config to {config_path}")
training_set_metadata_path = os.path.join(save_path, TRAIN_SET_METADATA_FILE_NAME)
if not os.path.exists(training_set_metadata_path):
save_json(training_set_metadata_path, training_set_metadata)
logger.info(f"Saved training set metadata to {training_set_metadata_path}")
for stage, module in stage_to_module.items():
module.save(os.path.join(save_path, stage_to_filenames[stage]))
logger.info(f"Saved torchscript module for {stage} to {stage_to_filenames[stage]}.")
def _init_inference_stages_from_directory(
directory: str,
device: TorchDevice,
) -> dict[str, torch.nn.Module]:
"""Initializes inference stage modules from directory."""
stage_to_filenames = {
stage: get_filename_from_stage(stage, device) for stage in [PREPROCESSOR, PREDICTOR, POSTPROCESSOR]
}
stage_to_module = {}
for stage in [PREPROCESSOR, PREDICTOR, POSTPROCESSOR]:
stage_to_module[stage] = torch.jit.load(os.path.join(directory, stage_to_filenames[stage]))
print(f"Loaded torchscript module for {stage} from {stage_to_filenames[stage]}.")
return stage_to_module
def _init_inference_stages_from_ludwig_model(
model: "BaseModel",
config: ModelConfigDict,
training_set_metadata: TrainingSetMetadataDict,
device: TorchDevice,
scripted: bool = True,
) -> dict[str, torch.nn.Module]:
"""Initializes inference stage modules from a LudwigModel (a BaseModel model, config, and
training_set_metadata)."""
preprocessor = _InferencePreprocessor(config, training_set_metadata)
predictor = _InferencePredictor(model, device=device)
postprocessor = _InferencePostprocessor(model, training_set_metadata)
stage_to_module = {
PREPROCESSOR: preprocessor,
PREDICTOR: predictor,
POSTPROCESSOR: postprocessor,
}
if scripted:
stage_to_module = {stage: torch.jit.script(module) for stage, module in stage_to_module.items()}
return stage_to_module
def _unflatten_dict_by_feature_name(flattened_dict: dict[str, Any]) -> dict[str, dict[str, Any]]:
"""Convert a flattened dictionary of objects to a nested dictionary of outputs per feature name."""
outputs: dict[str, dict[str, Any]] = {}
for concat_key, tensor_values in flattened_dict.items():
feature_name = get_feature_name_from_concat_name(concat_key)
tensor_name = get_tensor_name_from_concat_name(concat_key)
feature_outputs: dict[str, Any] = {}
if feature_name not in outputs:
outputs[feature_name] = feature_outputs
else:
feature_outputs = outputs[feature_name]
feature_outputs[tensor_name] = tensor_values
return outputs
================================================
FILE: ludwig/models/llm.py
================================================
import contextlib
import logging
import os
from typing import Any
import numpy as np
import torch
from transformers import AutoConfig, GenerationConfig
from ludwig.accounting.used_tokens import get_used_tokens_for_llm
from ludwig.constants import IGNORE_INDEX_TOKEN_ID, LOGITS, MODEL_LLM, PREDICTIONS, TEXT, USED_TOKENS
from ludwig.features.base_feature import ModuleWrapper, OutputFeature
from ludwig.features.feature_utils import LudwigFeatureDict
from ludwig.features.text_feature import TextOutputFeature
from ludwig.globals import MODEL_WEIGHTS_FILE_NAME
from ludwig.models.base import BaseModel
from ludwig.modules.training_hooks import NEFTuneHook
from ludwig.schema.features.base import BaseOutputFeatureConfig, FeatureCollection
from ludwig.schema.model_types.llm import LLMModelConfig
from ludwig.utils.augmentation_utils import AugmentationPipelines
from ludwig.utils.data_utils import clear_data_cache
from ludwig.utils.llm_quantization_utils import convert_quantized_linear_to_linear
from ludwig.utils.llm_utils import (
add_left_padding,
generate_merged_ids,
get_context_len,
get_realigned_target_and_prediction_tensors_for_inference,
initialize_adapter,
load_pretrained_from_config,
pad_target_tensor_for_fine_tuning,
remove_left_padding,
to_device,
)
from ludwig.utils.logging_utils import log_once
from ludwig.utils.output_feature_utils import set_output_feature_tensor
from ludwig.utils.tokenizers import HFTokenizer
from ludwig.utils.torch_utils import reg_loss
logger = logging.getLogger(__name__)
class DictWrapper:
"""Wrapper for a LudwigFeatureDict module that allows for iteration over keys.
The purpose of this class is to avoid exposing input and output features as modules of the LLM. This is because we
only wish to train the underlying model, and having these additional modules can confuse systems like DeepSpeed.
"""
def __init__(self, obj: LudwigFeatureDict):
self.obj = obj
def get(self, key) -> torch.nn.Module:
return self.obj.get(key)
def set(self, key: str, module: torch.nn.Module) -> None:
self.obj.set(key, module)
def __len__(self) -> int:
return len(self.obj)
def __next__(self) -> None:
return next(iter(self.obj))
def __iter__(self) -> None:
return iter(self.obj.keys())
def keys(self) -> list[str]:
return self.obj.keys()
def values(self) -> list[torch.nn.Module]:
return self.obj.values()
def items(self) -> list[tuple[str, torch.nn.Module]]:
return self.obj.items()
def update(self, modules: dict[str, torch.nn.Module]) -> None:
self.obj.update(modules)
class LLM(BaseModel):
@staticmethod
def type() -> str:
return MODEL_LLM
def __init__(
self,
config_obj: LLMModelConfig,
random_seed=None,
_device=None,
**_kwargs,
):
super().__init__(random_seed=random_seed)
self.config_obj = config_obj
self._random_seed = random_seed
self.model_name = self.config_obj.base_model
self.model_config = AutoConfig.from_pretrained(
self.config_obj.base_model,
trust_remote_code=self.config_obj.trust_remote_code,
)
self.model = load_pretrained_from_config(self.config_obj, model_config=self.model_config)
self.curr_device = next(self.model.parameters()).device
logger.info("Done.")
self.context_len = get_context_len(self.model_config)
# TODO(Arnav): This needs be more flexible to account for RoPE Scaling
# When merging input IDs and target IDs for LLM fine-tuning, we want to make sure that the merged tensor is
# not longer than the global maximum sequence length. This is provided in the preprocessing config. We never
# want to exceed the maximum possible context length so we also check for that.
if self.config_obj.preprocessing.global_max_sequence_length:
global_max_sequence_length = self.config_obj.preprocessing.global_max_sequence_length
self.global_max_sequence_length = (
global_max_sequence_length if global_max_sequence_length <= self.context_len else self.context_len
)
else:
self.global_max_sequence_length = self.context_len
# Initialize tokenizer
self.tokenizer = HFTokenizer(
self.config_obj.base_model,
trust_remote_code=self.config_obj.trust_remote_code,
).tokenizer
self._set_generation_config(self.config_obj.generation.to_dict())
# ================ Inputs ================
try:
self.input_features.update(self.build_inputs(input_feature_configs=self.config_obj.input_features))
except KeyError as e:
raise KeyError(
f"An input feature has a name that conflicts with a class attribute of torch's ModuleDict: {e}"
) from e
# This is used to store the model inputs during the forward pass when fine-tuning LLMs. This allows us to have
# access to the joint model inputs (input_ids and target_ids) when computing metrics. In particular, the target
# ids are needed to correctly compute next token softmax cross entropy loss.
self.model_inputs = None
# ================ Outputs ================
self.output_feature_type = self.config_obj.output_features[0].type
self.output_features.update(
self.build_outputs(
output_feature_configs=self.config_obj.output_features,
# Set the input size to the model vocab size instead of the tokenizer vocab size
# because the model has additional "head" layers that are used to predict the next
# token in the sequence. These head layers can add additional dimensions to the
# logits tensor, beyond the vocab_size dimension.
input_size=self.input_shape[-1] if self.output_feature_type == TEXT else self.model_config.vocab_size,
)
)
# Extract the decoder object for the forward pass
self._output_feature_decoder = ModuleWrapper(self.output_features.items()[0][1])
self.attention_masks = None
clear_data_cache()
def create_feature_dict(self) -> DictWrapper:
return DictWrapper(LudwigFeatureDict())
@contextlib.contextmanager
def use_generation_config(self, generation_config_dict: dict[str, Any] | None = None):
"""Sets the generation config for the model."""
# Save the original generation config so that we can reset it if/when we change it when self.generation gets is
# dynamically mutated during 1-off predict calls after fine-tuning.
original_generation_config_dict = self.generation.to_dict()
try:
# no-op if generation_config is None
if generation_config_dict is not None:
# unwrap the original generation config, update it with the new generation config
new_generation_config_dict = {**original_generation_config_dict, **generation_config_dict}
self._set_generation_config(new_generation_config_dict)
yield
finally:
self._set_generation_config(original_generation_config_dict)
def _set_generation_config(self, new_generation_config_dict: dict[str, Any]):
self.generation = GenerationConfig(**new_generation_config_dict)
# We need to manually set the pad_token_id to the tokenizer's pad_token_id for certain models like GPT and
# CodeLlama to avoid getting an error. This workaround can be found here:
# (https://github.com/huggingface/transformers/issues/25353#issuecomment-1669339754)
self.generation.pad_token_id = self.tokenizer.pad_token_id
self.max_new_tokens = self.generation.max_new_tokens
# max input length value copied from FastChat
# https://github.com/lm-sys/FastChat/blob/0e958b852a14f4bef5f0e9d7a5e7373477329cf2/fastchat/serve/inference.py#L183 # noqa E501
self.max_input_length = self.context_len - self.max_new_tokens - 8
@property
def output_feature_decoder(self) -> OutputFeature:
return self._output_feature_decoder.module
def initialize_adapter(self):
"""If an adapter config is provided, we want to wrap the model with a PEFT model for fine-tuning."""
if self.config_obj.adapter:
if self.config_obj.trainer.type != "finetune" and not self.config_obj.adapter.pretrained_adapter_weights:
raise ValueError(
"Adapter config was provided, but trainer type is not set to `finetune`. Either set the trainer to "
"`finetune` or remove the adapter config."
)
self.model = initialize_adapter(self.model, self.config_obj)
logger.info("==================================================")
logger.info("Trainable Parameter Summary For Fine-Tuning")
logger.info(f"Fine-tuning with adapter: {self.config_obj.adapter.type}")
self.model.print_trainable_parameters()
logger.info("==================================================")
def prepare_for_training(self):
# TODO: this implementation will not work if resuming from a previous checkpoint. Need to fix this.
if self.config_obj.quantization:
self.prepare_for_quantized_training()
self.initialize_adapter()
def prepare_for_quantized_training(self):
from peft import prepare_model_for_kbit_training
self.model = prepare_model_for_kbit_training(self.model, use_gradient_checkpointing=False)
def to_device(self, device):
# Always refresh curr_device from actual parameter location, since
# nn.Module.to() can move parameters without updating curr_device.
self.curr_device = next(self.model.parameters()).device
self.model, device = to_device(self.model, device, self.config_obj, self.curr_device)
self.curr_device = device
return self
@classmethod
def build_outputs(
cls, output_feature_configs: FeatureCollection[BaseOutputFeatureConfig], input_size: int
) -> dict[str, OutputFeature]:
"""Builds and returns output feature."""
# TODO: only single task currently
if len(output_feature_configs) > 1:
raise ValueError("The LLM model type only supports a single output feature.")
output_feature_config = output_feature_configs[0]
output_feature_config.input_size = input_size
output_features = {}
output_feature = cls.build_single_output(output_feature_config, output_features)
output_features[output_feature_config.name] = output_feature
return output_features
def forward(
self,
inputs: (
dict[str, torch.Tensor] | dict[str, np.ndarray] | tuple[dict[str, torch.Tensor], dict[str, torch.Tensor]]
),
mask=None,
) -> dict[str, torch.Tensor]:
"""Produces logits tensor for finetuning the model.
Args:
inputs: Inputs to the model. Can be a dictionary of input names to
input tensors or a tuple of (inputs, targets) where inputs is
a dictionary of input names to input tensors and targets is a
dictionary of target names to target tensors.
mask: A mask for the inputs.
Returns:
A dictionary of output {feature name}::{tensor_name} -> output tensor.
"""
input_ids, target_ids = self._unpack_inputs(inputs)
# Generate merged input_id, target_id pairs for the model, and create corresponding attention masks
# We save them as class variables so that we can use them when realigning target and prediction tensors
self.model_inputs, self.attention_masks = generate_merged_ids(
input_ids, target_ids, self.tokenizer, self.global_max_sequence_length
)
# TODO (jeffkinnison): Determine why the 8-bit `SCB` and `CB` matrices are deleted in the forward pass
model_outputs = self.model(input_ids=self.model_inputs, attention_mask=self.attention_masks).get(LOGITS)
if self.output_feature_type != TEXT:
# Pass generated tokens through decoder after averaging the token probabilities
# This is required for the classification head for the classifier decoder
model_outputs = torch.mean(model_outputs, dim=1)
if self.output_feature_type == TEXT:
decoder_outputs = model_outputs
else:
decoder_outputs = self.output_feature_decoder.decoder_obj(model_outputs)
# Set the output feature tensor to the decoder outputs (logits)
outputs = {}
of_name = self.config_obj.output_features[0].name
set_output_feature_tensor(outputs, of_name, LOGITS, decoder_outputs)
# Get predictions, probabilities and logits tensor from the output feature's predictions function
outputs = self.output_features.get(of_name).predictions(outputs, of_name)
# Cast to float32 for metric computation incase we're using deespeed with
# reduced precision such as bfloat16.
for prediction_key, prediction_tensor in outputs.items():
if prediction_key != PREDICTIONS:
# Skipping casting it to float32 since the predictions are tokens and they should be int64
# (which is already the case)
outputs[prediction_key] = prediction_tensor.type(torch.float32)
# Add token usage.
outputs[USED_TOKENS] = get_used_tokens_for_llm(self.model_inputs, self.tokenizer)
return outputs
def generate(
self,
inputs: (
dict[str, torch.Tensor] | dict[str, np.ndarray] | tuple[dict[str, torch.Tensor], dict[str, torch.Tensor]]
),
mask=None,
) -> dict[str, torch.Tensor]:
"""Generates tokens using the model."""
log_once(f"For generating text, using: {self.generation}")
input_ids, _ = self._unpack_inputs(inputs)
with torch.no_grad():
input_lengths = []
sequences_list = []
for input_ids_sample in input_ids:
input_ids_sample_no_padding = remove_left_padding(input_ids_sample, self.tokenizer)
if input_ids_sample_no_padding.shape[1] > self.max_input_length:
logger.warning(
f"Input length {input_ids_sample_no_padding.shape[1]} is "
f"greater than max input length {self.max_input_length}. Truncating."
)
input_ids_sample_no_padding = input_ids_sample_no_padding[:, -self.max_input_length :] # noqa E203
input_lengths.append(input_ids_sample_no_padding.shape[1])
# Ensure input_ids are on the same device as the model
model_device = next(self.model.parameters()).device
input_ids_sample_no_padding = input_ids_sample_no_padding.to(model_device)
# Generate text using the model
model_outputs = self.model.generate(
input_ids=input_ids_sample_no_padding,
attention_mask=mask,
generation_config=self.generation,
return_dict_in_generate=True,
output_scores=True,
)
sequences_list.append(model_outputs.sequences[0])
# Extract the predictions, probabilities and logits from the model outputs
# through the forward pass of the output feature
outputs = self.output_feature_decoder.decoder_obj.forward(
sequences_list,
input_lengths,
self.max_new_tokens,
)
return outputs
def is_merge_and_unload_set(self) -> bool:
"""Check if the "adapter" configuration section exists and, if affirmative, that it contains the
"postprocessor" subsection and the "merge_adapter_into_base_model" and "progressbar" directives.
# Return
:return (bool): whether merge_and_unload should be done.
"""
return (
self.config_obj.adapter is not None
and self.config_obj.adapter.postprocessor is not None
and self.config_obj.adapter.postprocessor.merge_adapter_into_base_model
)
def merge_and_unload(self, progressbar: bool = False) -> None:
"""This method merges the LoRa layers into the base model. This is needed if someone wants to use the base
model as a standalone model. The implementation calls merge_and_unload() of the underlying LoraModel class
(in peft).
Args:
progressbar (bool): whether to show a progressbar indicating the unload and merge process
"""
from peft import LoraModel
if isinstance(self.model.base_model, LoraModel):
self.model.base_model.merge_and_unload(progressbar=progressbar)
else:
raise ValueError("This operation requires an LLM model trained with a LoRA adapter.")
def _unpack_inputs(
self,
inputs: (
dict[str, torch.Tensor] | dict[str, np.ndarray] | tuple[dict[str, torch.Tensor], dict[str, torch.Tensor]]
),
) -> tuple[torch.Tensor, torch.Tensor | None]:
"""Converts input tensors to input ids."""
if isinstance(inputs, tuple):
inputs, targets = inputs
# Convert targets to tensors.
for target_feature_name, target_value in targets.items():
if not isinstance(target_value, torch.Tensor):
targets[target_feature_name] = torch.from_numpy(target_value)
else:
targets[target_feature_name] = target_value
else:
targets = None
assert list(inputs.keys()) == self.input_features.keys()
input_ids = self.get_input_ids(inputs)
target_ids = self.get_target_ids(targets) if targets else None
return input_ids, target_ids
def get_input_ids(
self,
inputs: (
dict[str, torch.Tensor] | dict[str, np.ndarray] | tuple[dict[str, torch.Tensor], dict[str, torch.Tensor]]
),
) -> torch.Tensor:
"""Returns the input ids for the text feature input."""
return inputs[self.config_obj.input_features[0].name].type(torch.int32)
def get_target_ids(self, outputs: dict[str, torch.Tensor]) -> torch.Tensor:
"""Returns the output ids for the text feature output."""
return outputs[self.config_obj.output_features[0].name].type(torch.int32)
def update_metrics(self, targets, predictions):
"""Updates the model's metrics given targets and predictions for zero-shot/few-shot."""
for of_name, of_obj in self.output_features.items():
if isinstance(of_obj, TextOutputFeature):
# Align the target length with the predictions length to enable text metric evaluation.
_targets, _predictions = get_realigned_target_and_prediction_tensors_for_inference(
targets, predictions, of_name, self.tokenizer
)
of_obj.update_metrics(_targets[of_name], _predictions[of_name], self.tokenizer)
else:
of_obj.update_metrics(targets[of_name], predictions[of_name])
# HACK (Tim): get the device of the targets to transfer self.eval_loss_metric to the same device
target_device = list(targets.values())[0].device
eval_loss, additional_losses = self.eval_loss(targets, predictions)
self.eval_loss_metric = self.eval_loss_metric.to(target_device)
self.eval_loss_metric.update(eval_loss)
self.eval_additional_losses_metrics.update(additional_losses)
def update_metrics_finetune_llm(self, targets, predictions):
"""Updates the model's metrics given targets and predictions for fine-tuning."""
_targets, _predictions = targets, predictions
for of_name, of_obj in self.output_features.items():
if isinstance(of_obj, TextOutputFeature):
# Update the target tensor to enable text metric evaluation. This pads the target tensor with -100s
# to match the prediction length and depends on how much of the target tensor was included in the
# forward pass.
_targets = self._update_target_tensor_for_finetuning(_targets, _predictions, of_name)
if isinstance(of_obj, TextOutputFeature):
of_obj.update_metrics(_targets[of_name], _predictions[of_name], self.tokenizer)
else:
of_obj.update_metrics(_targets[of_name], _predictions[of_name])
continue
of_obj.update_metrics(_targets[of_name], _predictions[of_name])
eval_loss, additional_losses = self.eval_loss(_targets, _predictions)
self.eval_loss_metric.update(eval_loss)
self.eval_additional_losses_metrics.update(additional_losses)
def train_loss(
self,
targets,
predictions,
regularization_type: str | None = None,
regularization_lambda: float | None = None,
) -> tuple[torch.Tensor, dict[str, torch.Tensor]]:
"""Computes the training loss for the model.
Args:
targets: A dictionary of target names to target tensors.
predictions: A dictionary of output names to output tensors.
regularization_type: One of 'l1', 'l2', 'l1_l2', or None.
regularization_lambda: The regularization lambda.
Returns:
A tuple of the loss tensor and a dictionary of loss for every
output feature.
"""
train_loss = 0
of_train_losses = {}
for of_name, of_obj in self.output_features.items():
_targets, _predictions = targets, predictions
if isinstance(of_obj, TextOutputFeature):
_predictions = {of_name: _predictions}
# Update the target tensor to enable text metric evaluation. This pads the target tensor with -100s
# to match the prediction length and depends on how much of the target tensor was included in the
# forward pass.
_targets = self._update_target_tensor_for_finetuning(_targets, _predictions, of_name)
# TODO(Arnav): Seems like doing this again and going between these format types in unnecessary, but
# refactor so that we don't have to do this at a later point.
predictions = {}
for key, _ in _predictions[of_name].items():
set_output_feature_tensor(predictions, of_name, key, _predictions[of_name][key])
_predictions = predictions
of_train_loss = of_obj.train_loss(_targets[of_name], _predictions, of_name)
train_loss += of_obj.loss.weight * of_train_loss
of_train_losses[of_name] = of_train_loss
additional_losses = self.losses()
if additional_losses:
train_loss += torch.sum(torch.stack(additional_losses)) # other losses
# Add regularization loss
if regularization_type is not None and regularization_lambda != 0:
train_loss += reg_loss(self, regularization_type, l1=regularization_lambda, l2=regularization_lambda)
return train_loss, of_train_losses
def eval_loss(self, targets, predictions):
"""Computes all evaluation losses for the model given targets and predictions.
Args:
targets: A dictionary of target names to target tensors.
predictions: A dictionary of output names to output tensors.
Returns:
A tuple of loss values for eval losses and additional losses.
"""
eval_loss = 0
for of_name, of_obj in self.output_features.items():
if isinstance(of_obj, TextOutputFeature):
# Align the target length with the predictions length to enable text metric evaluation.
_targets, _predictions = get_realigned_target_and_prediction_tensors_for_inference(
targets, predictions, of_name, self.tokenizer
)
of_eval_loss = of_obj.eval_loss(_targets[of_name], _predictions[of_name])
else:
# HACK(geoffrey): we need a non-empty loss, so we just fill it with zeros
of_eval_loss = torch.tensor(0.0).to(predictions[of_name][LOGITS].device)
eval_loss += of_obj.loss.weight * of_eval_loss
additional_loss = 0
additional_losses = self.losses()
if additional_losses:
additional_loss = torch.sum(torch.stack(additional_losses)) # other losses
return eval_loss, additional_loss
def outputs_to_predictions(self, outputs: dict[str, torch.Tensor]) -> dict[str, dict[str, torch.Tensor]]:
"""Returns the model's predictions for each output feature."""
predictions = {}
for of_name in self.output_features:
# TODO(travis): this will need to change when we support multiple output features
predictions[of_name] = outputs
return predictions
def save(self, save_path):
"""Saves the model to the given path."""
# TODO(travis): use the implementation of trainer itself to decide whether to save the model, to
# avoid this hack
if self.config_obj.trainer.type != "none":
weights_save_path = os.path.join(save_path, MODEL_WEIGHTS_FILE_NAME)
# We initialize the model's generation configuration; otherwise, we get a validation error.
self.model.generation_config = self.generation
self.model.save_pretrained(weights_save_path)
else:
logger.info("Skipped saving LLM without weight adjustments.")
def save_base_model(self, save_path):
"""Saves the base LLM model to the given path."""
# TODO: see the "TODO" statement from "LLM.save()" in this module.
if self.config_obj.trainer.type != "none":
weights_save_path = os.path.join(save_path, MODEL_WEIGHTS_FILE_NAME)
self.model.base_model.save_pretrained(weights_save_path)
# While this class initializes the tokenizer (from the base_model) automatically, and hence does not
# need to be saved if inference is to be done using LudwigModel.predict(), the rationale for saving the
# tokenizer to HuggingFace Hub is to provide access to models fine-tuned and persisted to HuggingFace Hub
# using Ludwig at a later time, with the ability to perform inference, independently of Ludwig itself.
self.tokenizer.save_pretrained(weights_save_path)
else:
logger.info("Skipped saving LLM without weight adjustments.")
def save_dequantized_base_model(self, save_path: str) -> None:
"""Upscales quantized weights of a model to fp16 and saves the result in a folder specified by save_path.
Args:
save_path (str): The path to the folder where the upscaled model weights will be saved.
Returns:
None
"""
from peft import PeftModel
if isinstance(self.model, PeftModel):
# Get the base model back by removing all the adapter modules without merging.
logger.warning(
"LLM model is currently wrapped in a PeftModel. Removing the adapter layers and saving the base model."
"Reload the model via LudwigModel.load() to use your trained adapter layers for inference."
)
self.model = self.model.unload()
# Dequantize the model weights and cast them to fp16 - replace quantized layers with appropriate
# linear layers in-place.
logger.info("Upscaling quantized weights to fp16...")
convert_quantized_linear_to_linear(self.model)
logger.info("Done.")
# Remove the quantization configuration from the model
# The reason we can't delete the quantization config is because it is a property of the model and
# HF does some weird serialization of the config that causes an error when trying to access `self.model.config`
# after you try and delete a key from the config: TypeError: Object of type dtype is not JSON serializable.
self.model.config.quantization_config = {}
# Override properties of the model to indicate that it is no longer quantized.
# This is also necessary to ensure that the model can be saved, otherwise it will raise an error like
# "You are calling `save_pretrained` on a 4-bit converted model. This is currently not supported"
# See: https://github.com/huggingface/transformers/blob/0ad4e7e6dad670a7151aaceb1af3c272a3bf73a8/src/transformers/modeling_utils.py#L2054 # noqa
self.model.is_loaded_in_4bit = False
self.model.is_loaded_in_8bit = False
# Save the model
logger.info(f"Saving upscaled model to {save_path}")
self.model.save_pretrained(save_path)
logger.info("Done.")
# Save the tokenizer
logger.info(f"Saving tokenizer to {save_path}")
self.tokenizer.save_pretrained(save_path)
logger.info("Done.")
def load(self, save_path):
"""Loads the model from the given path."""
weights_save_path = os.path.join(save_path, MODEL_WEIGHTS_FILE_NAME)
if self.config_obj.adapter:
# Check if the saved weights are merged (no adapter_config.json) or adapter-only
adapter_config_path = os.path.join(weights_save_path, "adapter_config.json")
if os.path.exists(adapter_config_path):
from peft import PeftModel # noqa
if isinstance(self.model, PeftModel):
# Unwrap and reload PeftModel
self.model = self.model.base_model
self.model = PeftModel.from_pretrained(self.model, weights_save_path)
else:
# Weights were already merged (merge_and_unload was done before save),
# so load as a regular pretrained model.
logger.info("Loading merged LoRA weights (no adapter_config.json found).")
self.model = load_pretrained_from_config(
self.config_obj, model_config=self.model_config, weights_save_path=weights_save_path
)
elif self.config_obj.trainer.type != "none":
self.model = load_pretrained_from_config(
self.config_obj, model_config=self.model_config, weights_save_path=weights_save_path
)
else:
logger.info("Skipped loading LLM without weight adjustments.")
def get_args(self):
"""Returns init arguments for constructing this model."""
return (
self.config_obj.input_features.to_list(),
self.config_obj.output_features.to_list(),
self._random_seed,
)
def _update_target_tensor_for_finetuning(
self, targets: dict[str, torch.Tensor], predictions: dict[str, torch.Tensor], of_name: str
) -> dict[str, torch.Tensor]:
"""Update target tensor for fine-tuning.
This method removes left padding from target tensors, adds a eos token to the end of the target tensors,
and pads the target tensors with -100 to ensure equal length for loss computation. It then realigns the
target tensors with the prediction tensors.
Args:
targets (Dict[str, torch.Tensor]): A dictionary containing the target tensors.
predictions (Dict[str, torch.Tensor]): A dictionary containing the predicted tensors.
of_name (str): The name of the target tensor.
Returns:
Dict[str, torch.Tensor]: A dictionary containing the updated target tensors aligned with predictions.
"""
# Remove left padding from target tensors since we also do this for the model's forward pass when we
# concatenate the input_ids with the target_ids. We also need to add the pad token to the end of the
# target tensors.
targets_without_padding = []
lengths = []
eos_token_tensor = torch.tensor([self.tokenizer.eos_token_id])
for target in targets[of_name]:
target = remove_left_padding(target, self.tokenizer)[0]
target = torch.cat([target, eos_token_tensor.to(device=target.device)], dim=-1).unsqueeze(0)
targets_without_padding.append(target)
lengths.append(target.shape[1])
# We need all target tensors to have the same length for the loss computation. We pad the target
# tensors with -100 since we want to negate all tokens that are not target_ids during the softmax
# cross entropy loss computation. This ensures that the loss is computed only for the target tokens.
max_length = max(lengths)
for i, target in enumerate(targets_without_padding):
targets_without_padding[i] = add_left_padding(
targets_without_padding[i][0],
max_length,
IGNORE_INDEX_TOKEN_ID,
)
targets[of_name] = torch.stack(targets_without_padding, dim=0).to(
dtype=targets[of_name].dtype,
device=targets[of_name].device,
)
# Re-align target tensors without padding to have equal length before realigning with the prediction
# tensors. Padding left with -100 to match the length of the target tensor masks the input ids during
# softmax cross entropy loss computation. This ensures that the loss is computed only for the target
# token IDs. Examples:
# BERTLMHead: https://github.com/huggingface/transformers/blob/v4.29.1/src/transformers/models/bert/modeling_bert.py#L1216-L1219 # noqa
# GPTNeoForCausalLM: https://github.com/huggingface/transformers/blob/v4.29.1/src/transformers/models/gpt_neo/modeling_gpt_neo.py#L736 # noqa
_targets = pad_target_tensor_for_fine_tuning(targets, predictions, self.model_inputs, of_name)
return _targets
def _activate_forward_hooks(self):
"""Activates/registers forward hooks for the model."""
if not self.config_obj.model_parameters:
return
# Initialize forward hook handles
if self.config_obj.model_parameters.neftune_noise_alpha:
self._forward_hook_handles.append(
NEFTuneHook(neftune_noise_alpha=self.config_obj.model_parameters.neftune_noise_alpha)
)
# Activate forward hooks iteratively
for hook in self._forward_hook_handles:
# Update the model with the forward hooks in place
self.model = hook.activate_hook(self.model)
@staticmethod
def get_augmentation_pipelines() -> AugmentationPipelines:
"""Returns the augmentation pipeline for this model."""
return AugmentationPipelines({})
================================================
FILE: ludwig/models/predictor.py
================================================
import logging
import os
import sys
from abc import ABC, abstractmethod
from collections import defaultdict, OrderedDict
from pprint import pformat
import numpy as np
import pandas as pd
import psutil
import torch
from torch import nn
from ludwig.constants import COMBINED, LAST_HIDDEN, LOGITS, MODEL_ECD, MODEL_LLM
from ludwig.data.dataset.base import Dataset
from ludwig.data.utils import convert_to_dict
from ludwig.distributed.base import DistributedStrategy, LocalStrategy
from ludwig.globals import is_progressbar_disabled, PREDICTIONS_PARQUET_FILE_NAME, TEST_STATISTICS_FILE_NAME
from ludwig.models.base import BaseModel
from ludwig.progress_bar import LudwigProgressBar
from ludwig.utils.data_utils import save_csv, save_json
from ludwig.utils.dataframe_utils import from_numpy_dataset
from ludwig.utils.print_utils import repr_ordered_dict
from ludwig.utils.registry import Registry
from ludwig.utils.strings_utils import make_safe_filename
from ludwig.utils.torch_utils import get_torch_device
EXCLUDE_PRED_SET = {LOGITS, LAST_HIDDEN}
SKIP_EVAL_METRICS = {"confusion_matrix", "roc_curve"}
STATS_SAMPLE_SIZE = 10000
logger = logging.getLogger(__name__)
class BasePredictor(ABC):
@abstractmethod
def batch_predict(self, dataset, dataset_name=None):
raise NotImplementedError()
@abstractmethod
def predict_single(self, batch):
raise NotImplementedError()
@abstractmethod
def batch_evaluation(self, dataset, collect_predictions=False, collect_logits=False, dataset_name=None):
raise NotImplementedError()
@abstractmethod
def batch_collect_activations(self, layer_names, dataset, bucketing_field=None):
raise NotImplementedError()
# Remote implementations may override this
def shutdown(self):
pass
# Functions needed to treat Trainer as a context manager
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.shutdown()
_predictor_registry = Registry[BasePredictor]()
def register_predictor(model_types: list[str]):
def wrap(cls):
for model_type in model_types:
_predictor_registry[model_type] = cls
return cls
return wrap
def get_predictor_cls(model_type: str) -> type[BasePredictor]:
return _predictor_registry[model_type]
@register_predictor([MODEL_ECD])
class Predictor(BasePredictor):
"""Predictor is a class that uses a model to predict and evaluate."""
def __init__(
self,
dist_model: nn.Module,
batch_size: int = 128,
distributed: DistributedStrategy = None,
report_tqdm_to_ray: bool = False,
model: BaseModel | None = None,
remote: bool = False,
**kwargs,
):
"""
:param dist_model: model to use for prediction, post-wrap for distributed training
:param batch_size: batch size to use for prediction
:param distributed: distributed strategy to use for prediction
:param report_tqdm_to_ray: whether to report tqdm progress to Ray
:param model: Ludwig BaseModel before being wrapped for distributed training.
Used to call Ludwig helper functions.
"""
model = model or dist_model
assert isinstance(model, BaseModel)
self._batch_size = batch_size
self._distributed = distributed if distributed is not None else LocalStrategy()
self.report_tqdm_to_ray = report_tqdm_to_ray
device = get_torch_device()
self.device = device
self.dist_model = dist_model
self.model = model
self.model.metrics_to_device(device)
if remote:
# Only return results from rank 0 to reduce network overhead
self.batch_predict = self._distributed.return_first(self.batch_predict)
self.batch_evaluation = self._distributed.return_first(self.batch_evaluation)
def batch_predict(self, dataset: Dataset, dataset_name: str = None, collect_logits: bool = False):
self.dist_model = self._distributed.to_device(self.dist_model)
prev_model_training_mode = self.dist_model.training # store previous model training mode
self.dist_model.eval() # set model to eval mode
with torch.no_grad():
with dataset.initialize_batcher(self._batch_size, should_shuffle=False) as batcher:
progress_bar_config = {
"desc": "Prediction" if dataset_name is None else f"Prediction {dataset_name: <5.5}",
"total": batcher.steps_per_epoch,
"file": sys.stdout,
"disable": is_progressbar_disabled(),
}
progress_bar = LudwigProgressBar(self.report_tqdm_to_ray, progress_bar_config, self.is_coordinator())
predictions = defaultdict(list)
while not batcher.last_batch():
batch = batcher.next_batch()
preds = self._predict(batch)
self._accumulate_preds(
preds, predictions, exclude_pred_set={LAST_HIDDEN} if collect_logits else EXCLUDE_PRED_SET
)
progress_bar.update(1)
progress_bar.close()
# consolidate predictions from each batch to a single tensor
self._concat_preds(predictions)
self.dist_model.train(prev_model_training_mode)
return from_numpy_dataset(predictions)
def predict_single(self, batch, collect_logits: bool = False):
prev_model_training_mode = self.dist_model.training # store previous model training mode
self.dist_model.eval() # set model to eval mode
with torch.no_grad():
predictions = defaultdict(list)
preds = self._predict(batch)
self._accumulate_preds(
preds, predictions, exclude_pred_set={LAST_HIDDEN} if collect_logits else EXCLUDE_PRED_SET
)
self._concat_preds(predictions)
# reset model to its original training mode
self.dist_model.train(prev_model_training_mode)
return from_numpy_dataset(predictions)
def _predict(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
"""Predict a batch of data.
Params:
model: BaseModel model
batch: batch of data
Returns:
predictions: dictionary of predictions
"""
inputs = {
i_feat.feature_name: torch.from_numpy(np.array(batch[i_feat.proc_column], copy=True)).to(self.device)
for i_feat in self.model.input_features.values()
}
outputs = self._predict_on_inputs(inputs)
return self.model.outputs_to_predictions(outputs)
def _accumulate_preds(self, preds, predictions, exclude_pred_set=EXCLUDE_PRED_SET):
# accumulate predictions from batch for each output feature
for of_name, of_preds in preds.items():
for pred_name, pred_values in of_preds.items():
if pred_name not in exclude_pred_set:
key = f"{of_name}_{pred_name}"
predictions[key].append(pred_values.detach().cpu())
def _concat_preds(self, predictions):
for key, pred_value_list in predictions.items():
# Without detaching, a runtime error is raised since pred_value_list
# is a tensor that requires grad.
predictions[key] = torch.cat(pred_value_list, dim=0).numpy()
def batch_evaluation(self, dataset, collect_predictions=False, collect_logits=False, dataset_name=None):
"""Batch evaluate model on dataset.
Params:
dataset (Union[str, dict, pandas.DataFrame]): source containing the entire dataset to be evaluated.
collect_predictions: Return model predictions.
collect_logits: Return model logits and final layer activations.
Returns:
Tuple of dictionaries of (metrics, predictions). The keys of metrics are determined by the metrics in the
model config. The keys of the predictions dictionary depend on which values are requested by the caller:
collect_predictions, collect_logits.
"""
self.dist_model = self._distributed.to_device(self.dist_model)
prev_model_training_mode = self.dist_model.training # store previous model training mode
self.dist_model.eval() # set model to eval mode
with torch.no_grad():
with dataset.initialize_batcher(
self._batch_size, should_shuffle=False, distributed=self._distributed
) as batcher:
progress_bar_config = {
"desc": "Evaluation" if dataset_name is None else f"Evaluation {dataset_name: <5.5}",
"total": batcher.steps_per_epoch,
"file": sys.stdout,
"disable": is_progressbar_disabled(),
"position": 0, # Necessary to disable extra new line artifacts in training logs.
}
progress_bar = LudwigProgressBar(self.report_tqdm_to_ray, progress_bar_config, self.is_coordinator())
predictions = defaultdict(list)
eval_steps = (
self.dist_model.config_obj.trainer.eval_steps
if hasattr(self.dist_model, "config_obj")
and hasattr(self.dist_model.config_obj.trainer, "eval_steps")
else None
)
eval_steps_counter = 0
while not batcher.last_batch():
if eval_steps and eval_steps_counter >= eval_steps:
logger.info(f"Reached evaluation step {eval_steps}. Ending evaluation.")
break
batch = batcher.next_batch()
logger.debug(
f"evaluation for {dataset_name}: obtained next batch "
f"memory used: {psutil.Process(os.getpid()).memory_info()[0] / 1e6:0.2f}MB"
)
inputs = {
i_feat.feature_name: torch.from_numpy(np.array(batch[i_feat.proc_column], copy=True)).to(
self.device
)
for i_feat in self.model.input_features.values()
}
targets = {
o_feat.feature_name: torch.from_numpy(np.array(batch[o_feat.proc_column], copy=True)).to(
self.device
)
for o_feat in self.model.output_features.values()
}
outputs = self._predict_on_inputs(inputs)
preds = self.model.outputs_to_predictions(outputs)
self.model.update_metrics(targets, preds)
# accumulate predictions from batch for each output feature
if collect_predictions:
self._accumulate_preds(
preds, predictions, exclude_pred_set={LAST_HIDDEN} if collect_logits else EXCLUDE_PRED_SET
)
progress_bar.update(1)
eval_steps_counter += 1
if self.is_coordinator():
logger.debug(
f"evaluation for {dataset_name}: completed batch {progress_bar.total_steps} "
f"memory used: {psutil.Process(os.getpid()).memory_info()[0] / 1e6:0.2f}MB"
)
progress_bar.close()
# consolidate predictions from each batch to a single tensor
if collect_predictions:
self._concat_preds(predictions)
metrics = self.model.get_metrics()
self.model.reset_metrics()
self.dist_model.train(prev_model_training_mode) # Restores previous model training mode.
return metrics, from_numpy_dataset(predictions)
def batch_collect_activations(self, layer_names, dataset, bucketing_field=None):
if bucketing_field:
raise ValueError("BucketedBatcher is not supported yet")
self.dist_model = self._distributed.to_device(self.dist_model)
prev_model_training_mode = self.dist_model.training # store previous model training mode
self.dist_model.eval() # set model to eval mode
with torch.no_grad():
with dataset.initialize_batcher(
self._batch_size, should_shuffle=False, distributed=self._distributed
) as batcher:
progress_bar_config = {
"desc": "Collecting Tensors",
"total": batcher.steps_per_epoch,
"file": sys.stdout,
"disable": is_progressbar_disabled(),
}
progress_bar = LudwigProgressBar(self.report_tqdm_to_ray, progress_bar_config, self.is_coordinator())
collected_tensors = []
while not batcher.last_batch():
batch = batcher.next_batch()
inputs = {
i_feat.feature_name: torch.from_numpy(np.array(batch[i_feat.proc_column], copy=True)).to(
self.device
)
for i_feat in self.model.input_features.values()
}
outputs = self._predict_on_inputs(inputs)
collected_tensors = [(concat_name, tensor) for concat_name, tensor in outputs.items()]
progress_bar.update(1)
progress_bar.close()
self.dist_model.train(prev_model_training_mode) # Restores previous model training mode.
return collected_tensors
def _predict_on_inputs(self, inputs: dict) -> dict:
return self.dist_model(inputs)
def is_coordinator(self):
return self._distributed.rank() == 0
@register_predictor([MODEL_LLM])
class LlmPredictor(Predictor):
def _predict_on_inputs(self, inputs: dict) -> dict:
return self.dist_model.generate(inputs)
class LlmFineTunePredictor(Predictor):
def batch_evaluation(self, dataset, collect_predictions=False, collect_logits=False, dataset_name=None):
"""Batch evaluate model on dataset.
Params:
dataset (Union[str, dict, pandas.DataFrame]): source containing the entire dataset to be evaluated.
collect_predictions: Return model predictions.
collect_logits: Return model logits and final layer activations.
Returns:
Tuple of dictionaries of (metrics, predictions, input/target/output dictionary). The keys of metrics are
determined by the metrics in the model config. The keys of the predictions dictionary depend on which values
are requested by the caller: collect_predictions, collect_logits. The keys of the input/target/output
dictionary are "inputs", "targets", and "outputs". The values of each of these keys are dictionaries of
feature names to lists of tensors. The tensors are the inputs, targets, and outputs for each batch.
"""
prev_model_training_mode = self.dist_model.training # store previous model training mode
self.dist_model.eval() # set model to eval mode
example_inputs = defaultdict(list)
example_targets = defaultdict(list)
example_outputs = defaultdict(list)
with torch.no_grad():
with dataset.initialize_batcher(
self._batch_size, should_shuffle=False, distributed=self._distributed
) as batcher:
progress_bar_config = {
"desc": "Evaluation" if dataset_name is None else f"Evaluation {dataset_name: <5.5}",
"total": batcher.steps_per_epoch,
"file": sys.stdout,
"disable": is_progressbar_disabled(),
"position": 0, # Necessary to disable extra new line artifacts in training logs.
}
progress_bar = LudwigProgressBar(self.report_tqdm_to_ray, progress_bar_config, self.is_coordinator())
predictions = defaultdict(list)
eval_steps = (
self.dist_model.config_obj.trainer.eval_steps
if hasattr(self.dist_model, "config_obj")
and hasattr(self.dist_model.config_obj.trainer, "eval_steps")
else None
)
eval_steps_counter = 0
while not batcher.last_batch():
if eval_steps and eval_steps_counter >= eval_steps:
logger.info(f"Reached evaluation step {eval_steps}. Ending evaluation.")
break
batch = batcher.next_batch()
logger.debug(
f"evaluation for {dataset_name}: obtained next batch "
f"memory used: {psutil.Process(os.getpid()).memory_info()[0] / 1e6:0.2f}MB"
)
inputs = {
i_feat.feature_name: torch.from_numpy(np.array(batch[i_feat.proc_column], copy=True)).to(
self.device
)
for i_feat in self.model.input_features.values()
}
targets = {
o_feat.feature_name: torch.from_numpy(np.array(batch[o_feat.proc_column], copy=True)).to(
self.device
)
for o_feat in self.model.output_features.values()
}
outputs = self._predict_on_inputs((inputs, targets))
preds = self.model.outputs_to_predictions(outputs)
for key in inputs:
example_inputs[key].extend(inputs[key])
for key in targets:
example_targets[key].extend(targets[key])
for key in preds:
example_outputs[key].extend(preds[key]["predictions"])
# Need to pass through a custom fine-tune metric function because we need to transform
# the targets into the right format for loss calculation (requires padding with -100s to the left)
# and other tensor alignment.
self.model.update_metrics_finetune_llm(targets, preds)
# accumulate predictions from batch for each output feature
if collect_predictions:
self._accumulate_preds(
preds, predictions, exclude_pred_set={LAST_HIDDEN} if collect_logits else EXCLUDE_PRED_SET
)
progress_bar.update(1)
eval_steps_counter += 1
if self.is_coordinator():
logger.debug(
f"evaluation for {dataset_name}: completed batch {progress_bar.total_steps} "
f"memory used: {psutil.Process(os.getpid()).memory_info()[0] / 1e6:0.2f}MB"
)
progress_bar.close()
# consolidate predictions from each batch to a single tensor
if collect_predictions:
for key, pred_value_list in predictions.items():
predictions[key] = torch.cat(pred_value_list, dim=0).detach().cpu().numpy()
metrics = self.model.get_metrics()
self.model.reset_metrics()
input_target_output_dict = {
"inputs": example_inputs,
"targets": example_targets,
"outputs": example_outputs,
}
self.dist_model.train(prev_model_training_mode) # Restores previous model training mode.
return metrics, from_numpy_dataset(predictions), input_target_output_dict
def calculate_overall_stats(output_features, predictions, dataset, training_set_metadata):
overall_stats = {}
for of_name, output_feature in output_features.items():
feature_metadata = training_set_metadata[output_feature.feature_name]
feature_metadata.update(training_set_metadata[output_feature.feature_name])
feature_df = predictions.loc[:, predictions.columns.str.startswith(of_name)]
feature_df = feature_df.rename(columns=lambda c: c[len(of_name) + 1 :])
target = dataset.loc[:, output_feature.proc_column]
if not isinstance(feature_df, pd.DataFrame):
logger.warning(
"Full computation of stats only supported for pandas dataframes. "
"Sampling the first 10000 rows of the feature and target dataframes for computing overall stats."
)
feature_df = feature_df.head(n=STATS_SAMPLE_SIZE, npartitions=-1, compute=True)
target = target.head(n=STATS_SAMPLE_SIZE, npartitions=-1, compute=True)
overall_stats[of_name] = output_feature.calculate_overall_stats(
feature_df, # predictions
target,
feature_metadata, # output feature metadata
)
return overall_stats
def save_prediction_outputs(
postprocessed_output,
output_features,
output_directory,
backend,
):
backend.df_engine.write_predictions(
postprocessed_output, os.path.join(output_directory, PREDICTIONS_PARQUET_FILE_NAME)
)
if not backend.df_engine.partitioned:
# csv can only be written out for unpartitioned df format (i.e., pandas)
postprocessed_dict = convert_to_dict(postprocessed_output, output_features)
csv_filename = os.path.join(output_directory, "{}_{}.csv")
for output_field, outputs in postprocessed_dict.items():
for output_name, values in outputs.items():
save_csv(csv_filename.format(output_field, make_safe_filename(output_name)), values)
def save_evaluation_stats(test_stats, output_directory):
test_stats_fn = os.path.join(output_directory, TEST_STATISTICS_FILE_NAME)
save_json(test_stats_fn, test_stats)
def print_evaluation_stats(test_stats):
for output_field, result in test_stats.items():
if output_field != COMBINED or (output_field == COMBINED and len(test_stats) > 2):
logger.info(f"\n===== {output_field} =====")
for metric in sorted(list(result)):
if metric not in SKIP_EVAL_METRICS:
value = result[metric]
if isinstance(value, OrderedDict):
value_repr = repr_ordered_dict(value)
else:
value_repr = pformat(result[metric], indent=2)
logger.info(f"{metric}: {value_repr}")
def get_output_columns(output_features, include_logits: bool = False):
output_columns = []
for of_name, feature in output_features.items():
for pred in feature.get_prediction_set():
if pred not in EXCLUDE_PRED_SET or (pred == LOGITS and include_logits):
output_columns.append(f"{of_name}_{pred}")
return output_columns
================================================
FILE: ludwig/models/registry.py
================================================
import logging
from ludwig.constants import MODEL_ECD, MODEL_LLM
from ludwig.models.ecd import ECD
from ludwig.models.llm import LLM
logger = logging.getLogger(__name__)
model_type_registry = {
MODEL_ECD: ECD,
MODEL_LLM: LLM,
}
================================================
FILE: ludwig/models/retrieval.py
================================================
import hashlib
import json
import os
from abc import ABC, abstractmethod
from collections.abc import Callable
from typing import Any, TYPE_CHECKING
import numpy as np
import pandas as pd
from tqdm import tqdm
from ludwig.vector_index import FAISS, get_vector_index_cls
from ludwig.vector_index.base import VectorIndex
if TYPE_CHECKING:
from sentence_transformers import SentenceTransformer
from ludwig.backend.base import Backend
from ludwig.utils.batch_size_tuner import BatchSizeEvaluator
from ludwig.utils.torch_utils import get_torch_device
def df_checksum(df: pd.DataFrame) -> str:
return hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest()
def df_to_row_strs(df: pd.DataFrame) -> list[str]:
rows = df.to_dict(orient="records")
row_strs = [json.dumps(r) for r in rows]
return row_strs
class RetrievalModel(ABC):
@abstractmethod
def create_dataset_index(self, df: pd.DataFrame, backend: "Backend", columns_to_index: list[str] | None = None):
"""Creates an index for the dataset.
If `columns_to_index` is None, all columns are indexed. Otherwise, only the columns in `columns_to_index` are
used for indexing, but all columns in `df` are returned in the search results.
"""
@abstractmethod
def search(
self, df, backend: "Backend", k: int = 10, return_data: bool = False
) -> list[int] | list[dict[str, Any]]:
"""Retrieve the top k results for the given query.
If `return_data` is True, returns the data associated with the indices. Otherwise, returns the indices.
"""
@abstractmethod
def save_index(self, name: str, cache_directory: str):
"""Saves the index to the cache directory."""
@abstractmethod
def load_index(self, name: str, cache_directory: str):
"""Loads the index from the cache directory."""
class RandomRetrieval(RetrievalModel):
"""Random retrieval model.
Gets k random indices from the dataset regardless of the query.
"""
def __init__(self, **kwargs):
self.index = None
self.index_data = None
def create_dataset_index(self, df: pd.DataFrame, backend: "Backend", columns_to_index: list[str] | None = None):
self.index = np.array(range(len(df)))
self.index_data = df
def search(
self, df, backend: "Backend", k: int = 10, return_data: bool = False
) -> list[int] | list[dict[str, Any]]:
results = []
for _ in tqdm(range(len(df))):
indices = np.random.choice(self.index, k, replace=False)
if return_data:
result = self.index_data.iloc[indices].to_dict(orient="records")
else:
result = indices
results.append(result)
return results
def save_index(self, name: str, cache_directory: str):
index_file_path = os.path.join(cache_directory, name + ".index")
# open file to prevent using the .npy extension
# https://numpy.org/doc/stable/reference/generated/numpy.save.html
with open(index_file_path, "wb") as f:
np.save(f, self.index)
index_data_file_path = os.path.join(cache_directory, name + "_data.csv")
self.index_data.to_csv(index_data_file_path, index=False)
def load_index(self, name: str, cache_directory: str):
index_file_path = os.path.join(cache_directory, name + ".index")
self.index = np.load(index_file_path)
index_data_file_path = os.path.join(cache_directory, name + "_data.csv")
self.index_data = pd.read_csv(index_data_file_path)
class SemanticRetrieval(RetrievalModel):
"""Semantic retrieval model.
Uses a sentence transformer model to encode the dataset and retrieve the top k most similar results to the query.
"""
def __init__(self, model_name, **kwargs):
self.model_name = model_name
self.model = get_semantic_retrieval_model(self.model_name)
self.index: VectorIndex = None
self.index_data: pd.DataFrame = None
# best batch size computed during the encoding step
self.best_batch_size = None
def create_dataset_index(self, df: pd.DataFrame, backend: "Backend", columns_to_index: list[str] | None = None):
if columns_to_index is None:
columns_to_index = df.columns
df_to_index = df[columns_to_index]
row_strs = df_to_row_strs(df_to_index)
embeddings = self._encode(row_strs, backend)
self.index = get_vector_index_cls(FAISS).from_embeddings(embeddings)
# Save the entire df so we can return the full row when searching
self.index_data = df
def _encode(self, row_strs: list[str], backend: "Backend") -> np.ndarray:
# only do this step once
if self.best_batch_size is None:
self.best_batch_size = backend.tune_batch_size(
create_semantic_retrieval_model_evaluator(self.model, row_strs), len(row_strs)
)
transform_fn = create_semantic_retrieval_model_fn(self.model, self.best_batch_size)
df = backend.df_engine.from_pandas(pd.DataFrame({"data": row_strs}))
df = backend.batch_transform(df, self.best_batch_size, transform_fn)
df = backend.df_engine.compute(df)
embeddings = np.stack(df["data"].values).astype(np.float32)
return embeddings
def search(
self, df: pd.DataFrame, backend: "Backend", k: int = 10, return_data: bool = False
) -> list[int] | list[dict[str, Any]]:
row_strs = df_to_row_strs(df)
query_vectors = self._encode(row_strs, backend)
results = []
# TODO(geoffrey): figure out why self.index.search segfaults with larger batch sizes
for query_vector in tqdm(query_vectors, total=query_vectors.shape[0]):
indices = self.index.search(query_vector.reshape(1, -1), k)
if return_data:
result = self.index_data.iloc[indices].to_dict(orient="records")
else:
result = indices
results.append(result)
return results
def save_index(self, name: str, cache_directory: str):
index_file_path = os.path.join(cache_directory, name + ".index")
self.index.save(index_file_path)
index_data_file_path = os.path.join(cache_directory, name + "_data.csv")
self.index_data.to_csv(index_data_file_path, index=False)
def load_index(self, name: str, cache_directory: str):
index_file_path = os.path.join(cache_directory, name + ".index")
self.index = get_vector_index_cls(FAISS).from_path(index_file_path)
index_data_file_path = os.path.join(cache_directory, name + "_data.csv")
self.index_data = pd.read_csv(index_data_file_path)
def create_semantic_retrieval_model_evaluator(
model: "SentenceTransformer", samples: list[str]
) -> type[BatchSizeEvaluator]:
class _RetrievalModelEvaluator(BatchSizeEvaluator):
def __init__(self):
self.model = model.to(get_torch_device())
self.samples = samples
def step(self, batch_size: int, global_max_sequence_length: int | None = None):
self.model.encode(self.samples[:batch_size], batch_size=batch_size, show_progress_bar=False)
return _RetrievalModelEvaluator
def create_semantic_retrieval_model_fn(
model: "SentenceTransformer", batch_size: int
) -> Callable[[pd.DataFrame], np.ndarray]:
class _RetrievalModelFn:
def __init__(self):
self.model = model.to(get_torch_device())
self.batch_size = batch_size
def __call__(self, df: pd.DataFrame) -> np.ndarray:
row_strs = df["data"].tolist()
result = self.model.encode(row_strs, batch_size=self.batch_size, show_progress_bar=False)
df["data"] = result.tolist()
return df
return _RetrievalModelFn
def get_semantic_retrieval_model(model_name: str) -> "SentenceTransformer":
from sentence_transformers import SentenceTransformer
return SentenceTransformer(model_name, device=get_torch_device())
def get_retrieval_model(type: str, **kwargs) -> RetrievalModel:
if type == "random":
return RandomRetrieval(**kwargs)
elif type == "semantic":
return SemanticRetrieval(**kwargs)
else:
raise ValueError(f"Unsupported retrieval model type: {type}")
================================================
FILE: ludwig/modules/__init__.py
================================================
================================================
FILE: ludwig/modules/attention_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from torch import nn
from torch.nn import functional as F
from ludwig.utils.torch_utils import get_activation, LudwigModule
logger = logging.getLogger(__name__)
class FeedForwardAttentionReducer(LudwigModule):
def __init__(self, input_size, hidden_size=256, activation="tanh"):
super().__init__()
self.fc_layer1 = nn.Linear(input_size, hidden_size)
self.fc_layer1_activation = get_activation(activation)
self.fc_layer2 = nn.Linear(hidden_size, 1, bias=False)
self.input_shape_var = None
self.output_shape_var = None
def forward(self, inputs, mask=None):
# current_inputs shape [b, s, h]
self.input_shape_var = inputs.size()[1:]
hidden = self.fc_layer1(inputs) # [b, s, h']
hidden = self.fc_layer1_activation(hidden)
hidden = self.fc_layer2(hidden) # [b, s, 1]
attention = F.softmax(hidden, dim=1)
gated_inputs = torch.sum(attention * inputs, dim=1)
self.output_shape_var = gated_inputs.size()[1:]
return gated_inputs # [b, h]
@property
def input_shape(self) -> torch.Size:
return self.input_shape_var
@property
def output_shape(self) -> torch.Size:
return self.output_shape_var
class MultiHeadSelfAttention(LudwigModule):
def __init__(self, input_size, hidden_size, num_heads=8):
super().__init__()
self.embedding_size = hidden_size
self.num_heads = num_heads
if hidden_size % num_heads != 0:
raise ValueError(
f"When using multi-head attention, `hidden_size` ({hidden_size}), should be divisible by "
f"`num_heads` ({num_heads}). Please update the `transformer` section of the model config."
)
self.projection_dim = hidden_size // num_heads
self.query_dense = nn.Linear(input_size, hidden_size)
self.key_dense = nn.Linear(input_size, hidden_size)
self.value_dense = nn.Linear(input_size, hidden_size)
self.combine_heads = nn.Linear(hidden_size, hidden_size)
def separate_heads(self, inputs, batch_size):
inputs = torch.reshape(inputs, (batch_size, -1, self.num_heads, self.projection_dim))
return torch.permute(inputs, (0, 2, 1, 3))
def forward(self, inputs: torch.Tensor, mask=None):
# inputs.shape = [batch_size, seq_len, embedding_dim]
batch_size = inputs.shape[0]
query = self.query_dense(inputs) # (batch_size, seq_len, h)
key = self.key_dense(inputs) # (batch_size, seq_len, h)
value = self.value_dense(inputs) # (batch_size, seq_len, h)
query = self.separate_heads(query, batch_size) # (batch_size, num_heads, seq_len, projection_dim)
key = self.separate_heads(key, batch_size) # (batch_size, num_heads, seq_len, projection_dim)
value = self.separate_heads(value, batch_size) # (batch_size, num_heads, seq_len, projection_dim)
attn_mask = mask if mask is not None else None
outputs = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask)
outputs = torch.permute(outputs, (0, 2, 1, 3)) # (batch_size, seq_len, num_heads, projection_dim)
concat_outputs = torch.reshape(outputs, (batch_size, -1, self.embedding_size)) # (batch_size, seq_len, h)
projected_outputs = self.combine_heads(concat_outputs) # (batch_size, seq_len, h)
return projected_outputs
@property
def output_shape(self):
return torch.Size([self.embedding_size])
class TransformerBlock(LudwigModule):
def __init__(
self,
input_size: int,
max_sequence_length: int,
hidden_size: int,
num_heads: int,
output_size: int,
dropout: float = 0.1,
):
super().__init__()
self.input_size = input_size
self.max_sequence_length = max_sequence_length
self.hidden_size = hidden_size
self.self_attention = MultiHeadSelfAttention(input_size, hidden_size, num_heads=num_heads)
self.dropout1 = nn.Dropout(dropout)
self.layernorm1 = nn.LayerNorm(hidden_size, eps=1e-6)
self.fully_connected = nn.Sequential(
nn.Linear(input_size, output_size), get_activation("relu"), nn.Linear(output_size, hidden_size)
)
self.dropout2 = nn.Dropout(dropout)
self.layernorm2 = nn.LayerNorm(hidden_size, eps=1e-6)
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length, self.input_size])
def forward(self, inputs, mask=None):
# inputs [b, s, h]
attn_output = self.self_attention(inputs) # [b, s, h]
attn_output = self.dropout1(attn_output) # [b, s, h]
ln1_output = self.layernorm1(inputs + attn_output) # [b, s, h]
fc_output = self.fully_connected(ln1_output) # [b, s, h]
fc_output = self.dropout2(fc_output) # [b, s, h]
return self.layernorm2(ln1_output + fc_output) # [b, s, h]
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length, self.hidden_size])
class TransformerStack(LudwigModule):
def __init__(
self,
input_size: int,
max_sequence_length: int,
hidden_size: int = 256,
num_heads: int = 8,
output_size: int = 256,
num_layers: int = 1,
dropout: float = 0.1,
**kwargs,
):
super().__init__()
self.supports_masking = True
self.max_sequence_length = max_sequence_length
self.input_size = input_size
self.hidden_size = hidden_size
self.layers = nn.ModuleList()
prior_input_size = input_size
for i in range(num_layers):
layer = TransformerBlock(
input_size=prior_input_size,
max_sequence_length=max_sequence_length,
hidden_size=hidden_size,
num_heads=num_heads,
output_size=output_size,
dropout=dropout,
)
self.layers.append(layer)
prior_input_size = self.layers[i].output_shape[-1]
for layer in self.layers:
logger.debug(f" {layer._get_name()}")
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length, self.input_size])
def forward(self, inputs, mask=None):
hidden = inputs
for layer in self.layers:
hidden = layer(hidden, mask=mask)
return hidden
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length, self.hidden_size])
================================================
FILE: ludwig/modules/convolutional_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from functools import partial
from typing import Any
import torch
import torch.nn as nn
from ludwig.utils.image_utils import get_img_output_shape
from ludwig.utils.torch_utils import get_activation, LudwigModule
logger = logging.getLogger(__name__)
class Conv1DLayer(LudwigModule):
def __init__(
self,
in_channels=1,
out_channels=256,
max_sequence_length=None,
kernel_size=3,
strides=1,
padding="same",
dilation=1,
groups=1,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
norm=None,
norm_params=None,
activation="relu",
dropout=0,
pool_function="max",
pool_size=2,
pool_strides=None,
pool_padding="valid",
):
super().__init__()
self.in_channels = in_channels
self.out_channels = out_channels
self.max_sequence_length = max_sequence_length
self.kernel_size = kernel_size
self.stride = strides
self.padding = padding
self.dilation = dilation
self.groups = groups
self.pool_size = pool_size
if pool_strides is None:
self.pool_strides = pool_size
else:
self.pool_strides = pool_strides
if pool_padding == "same" and pool_size is not None:
self.pool_padding = (self.pool_size - 1) // 2
else:
self.pool_padding = 0
self.layers = nn.ModuleList()
self.layers.append(
nn.Conv1d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=(kernel_size,),
stride=(strides,),
padding=padding,
dilation=(dilation,),
)
)
if norm and norm_params is None:
norm_params = {}
if norm == "batch":
self.layers.append(nn.BatchNorm1d(num_features=out_channels, **norm_params))
elif norm == "layer":
self.layers.append(nn.LayerNorm(normalized_shape=[out_channels, self.max_sequence_length], **norm_params))
self.layers.append(get_activation(activation))
if dropout > 0:
self.layers.append(nn.Dropout(dropout))
if pool_size is not None:
pool = nn.MaxPool1d
if pool_function in {"average", "avg", "mean"}:
pool = nn.AvgPool1d
self.layers.append(pool(kernel_size=self.pool_size, stride=self.pool_strides, padding=self.pool_padding))
for layer in self.layers:
logger.debug(f" {layer._get_name()}")
@property
def input_shape(self):
"""Returns the size of the input tensor without the batch dimension."""
return torch.Size([self.max_sequence_length, self.in_channels])
def forward(self, inputs, training=None, mask=None):
# inputs: [batch_size, seq_size, in_channels]
# in Torch nomenclature (N, L, C)
hidden = inputs
# put in torch compatible form [batch_size, in_channels, seq_size]
hidden = hidden.transpose(1, 2)
for layer in self.layers:
hidden = layer(hidden)
# revert back to normal form [batch_size, seq_size, out_channels]
hidden = hidden.transpose(1, 2)
return hidden # (batch_size, seq_size, out_channels)
class Conv1DStack(LudwigModule):
def __init__(
self,
in_channels=1,
max_sequence_length=None,
layers=None,
num_layers=None,
default_num_filters=256,
default_filter_size=3,
default_strides=1,
default_padding="same",
default_dilation_rate=1,
default_use_bias=True,
default_weights_initializer="xavier_uniform",
default_bias_initializer="zeros",
default_norm=None,
default_norm_params=None,
default_activation="relu",
default_dropout=0,
default_pool_function="max",
default_pool_size=2,
default_pool_strides=None,
default_pool_padding="same",
**kwargs,
):
super().__init__()
self.max_sequence_length = max_sequence_length
self.in_channels = in_channels
if layers is None:
if num_layers is None:
self.layers = [
{"filter_size": 7, "pool_size": 3},
{"filter_size": 7, "pool_size": 3},
{"filter_size": 3, "pool_size": None},
{"filter_size": 3, "pool_size": None},
{"filter_size": 3, "pool_size": None},
{"filter_size": 3, "pool_size": 3},
]
else:
self.layers = []
for i in range(num_layers):
self.layers.append(
{
"filter_size": default_filter_size,
"num_filters": default_num_filters,
"pool_size": default_pool_size,
"pool_strides": default_pool_strides,
}
)
else:
self.layers = layers
for layer in self.layers:
if "num_filters" not in layer:
layer["num_filters"] = default_num_filters
if "filter_size" not in layer:
layer["filter_size"] = default_filter_size
if "strides" not in layer:
layer["strides"] = default_strides
if "padding" not in layer:
layer["padding"] = default_padding
if "dilation_rate" not in layer:
layer["dilation_rate"] = default_dilation_rate
if "use_bias" not in layer:
layer["use_bias"] = default_use_bias
if "weights_initializer" not in layer:
layer["weights_initializer"] = default_weights_initializer
if "bias_initializer" not in layer:
layer["bias_initializer"] = default_bias_initializer
if "norm" not in layer:
layer["norm"] = default_norm
if "norm_params" not in layer:
layer["norm_params"] = default_norm_params
if "activation" not in layer:
layer["activation"] = default_activation
if "dropout" not in layer:
layer["dropout"] = default_dropout
if "pool_function" not in layer:
layer["pool_function"] = default_pool_function
if "pool_size" not in layer:
layer["pool_size"] = default_pool_size
if "pool_strides" not in layer:
layer["pool_strides"] = default_pool_strides
if "pool_padding" not in layer:
layer["pool_padding"] = default_pool_padding
self.stack = nn.ModuleList()
prior_layer_channels = in_channels
l_in = self.max_sequence_length # torch L_in
for i, layer in enumerate(self.layers):
logger.debug(f" stack layer {i}")
self.stack.append(
Conv1DLayer(
in_channels=prior_layer_channels,
out_channels=layer["num_filters"],
max_sequence_length=l_in,
kernel_size=layer["filter_size"],
strides=layer["strides"],
padding=layer["padding"],
dilation=layer["dilation_rate"],
use_bias=layer["use_bias"],
weights_initializer=layer["weights_initializer"],
bias_initializer=layer["bias_initializer"],
norm=layer["norm"],
norm_params=layer["norm_params"],
activation=layer["activation"],
dropout=layer["dropout"],
pool_function=layer["pool_function"],
pool_size=layer["pool_size"],
pool_strides=layer["pool_strides"],
pool_padding=layer["pool_padding"],
)
)
# retrieve number of channels from prior layer
input_shape = self.stack[i].input_shape
output_shape = self.stack[i].output_shape
logger.debug(f"{self.__class__.__name__}: " f"input_shape {input_shape}, output shape {output_shape}")
# pass along shape for the input to the next layer
l_in, prior_layer_channels = output_shape
@property
def input_shape(self):
"""Returns the size of the input tensor without the batch dimension."""
return torch.Size([self.max_sequence_length, self.in_channels])
def forward(self, inputs, mask=None):
hidden = inputs
# todo: enumerate for debugging, remove after testing
for i, layer in enumerate(self.stack):
hidden = layer(hidden)
if hidden.shape[1] == 0:
raise ValueError(
"The output of the conv stack has the second dimension "
"(length of the sequence) equal to 0. "
"This means that the combination of filter_size, padding, "
"stride, pool_size, pool_padding and pool_stride reduces "
"the sequence length more than is possible. "
'Try using "same" padding and reducing or eliminating stride '
"and pool."
)
return hidden
class ParallelConv1D(LudwigModule):
def __init__(
self,
in_channels=1,
max_sequence_length=None,
layers=None,
default_num_filters=256,
default_filter_size=3,
default_strides=1,
default_padding="same",
default_dilation_rate=1,
default_use_bias=True,
default_weights_initializer="xavier_uniform",
default_bias_initializer="zeros",
default_norm=None,
default_norm_params=None,
default_activation="relu",
default_dropout=0,
default_pool_function="max",
default_pool_size=None,
default_pool_strides=None,
default_pool_padding="valid",
**kwargs,
):
super().__init__()
self.in_channels = in_channels
self.max_sequence_length = max_sequence_length
if layers is None:
self.layers = [{"filter_size": 2}, {"filter_size": 3}, {"filter_size": 4}, {"filter_size": 5}]
else:
self.layers = layers
for layer in self.layers:
if "num_filters" not in layer:
layer["num_filters"] = default_num_filters
if "filter_size" not in layer:
layer["filter_size"] = default_filter_size
if "strides" not in layer:
layer["strides"] = default_strides
if "padding" not in layer:
layer["padding"] = default_padding
if "dilation_rate" not in layer:
layer["dilation_rate"] = default_dilation_rate
if "use_bias" not in layer:
layer["use_bias"] = default_use_bias
if "weights_initializer" not in layer:
layer["weights_initializer"] = default_weights_initializer
if "bias_initializer" not in layer:
layer["bias_initializer"] = default_bias_initializer
if "norm" not in layer:
layer["norm"] = default_norm
if "norm_params" not in layer:
layer["norm_params"] = default_norm_params
if "activation" not in layer:
layer["activation"] = default_activation
if "dropout" not in layer:
layer["dropout"] = default_dropout
if "pool_function" not in layer:
layer["pool_function"] = default_pool_function
if "pool_size" not in layer:
layer["pool_size"] = default_pool_size
if "pool_strides" not in layer:
layer["pool_strides"] = default_pool_strides
if "pool_padding" not in layer:
layer["pool_padding"] = default_pool_padding
self.parallel_layers = nn.ModuleList()
for i, layer in enumerate(self.layers):
logger.debug(f" parallel layer {i}")
self.parallel_layers.append(
Conv1DLayer(
in_channels=self.in_channels,
out_channels=layer["num_filters"],
max_sequence_length=self.max_sequence_length,
kernel_size=layer["filter_size"],
strides=layer["strides"],
padding=layer["padding"],
dilation=layer["dilation_rate"],
use_bias=layer["use_bias"],
weights_initializer=layer["weights_initializer"],
bias_initializer=layer["bias_initializer"],
norm=layer["norm"],
norm_params=layer["norm_params"],
activation=layer["activation"],
dropout=layer["dropout"],
pool_function=layer["pool_function"],
pool_size=layer["pool_size"],
pool_strides=layer["pool_strides"],
pool_padding=layer["pool_padding"],
)
)
logger.debug(
f"{self.__class__.__name__} layer {i}, input shape "
f"{self.parallel_layers[i].input_shape}, output shape "
f"{self.parallel_layers[i].output_shape}"
)
@property
def input_shape(self) -> torch.Size:
"""Returns the size of the input tensor without the batch dimension."""
return torch.Size([self.max_sequence_length, self.in_channels])
def forward(self, inputs, mask=None):
# inputs: [batch_size, seq_size, in_channels)
hidden = inputs
hiddens = []
for layer in self.parallel_layers:
hiddens.append(layer(hidden))
hidden = torch.cat(hiddens, 2)
if hidden.shape[1] == 0:
raise ValueError(
"The output of the conv stack has the second dimension "
"(length of the sequence) equal to 0. "
"This means that the combination of filter_size, padding, "
"stride, pool_size, pool_padding and pool_stride reduces "
"the sequence length more than is possible. "
'Try using "same" padding and reducing or eliminating stride '
"and pool."
)
# (batch_size, seq_size, len(parallel_layers) * out_channels)
return hidden
class ParallelConv1DStack(LudwigModule):
def __init__(
self,
in_channels=None,
stacked_layers=None,
max_sequence_length=None,
default_num_filters=64,
default_filter_size=3,
default_strides=1,
default_padding="same",
default_dilation_rate=1,
default_use_bias=True,
default_weights_initializer="xavier_uniform",
default_bias_initializer="zeros",
default_norm=None,
default_norm_params=None,
default_activation="relu",
default_dropout=0,
default_pool_function="max",
default_pool_size=None,
default_pool_strides=None,
default_pool_padding="valid",
**kwargs,
):
super().__init__()
self.max_sequence_length = max_sequence_length
self.in_channels = in_channels
if stacked_layers is None:
self.stacked_parallel_layers = [
[{"filter_size": 2}, {"filter_size": 3}, {"filter_size": 4}, {"filter_size": 5}],
[{"filter_size": 2}, {"filter_size": 3}, {"filter_size": 4}, {"filter_size": 5}],
[{"filter_size": 2}, {"filter_size": 3}, {"filter_size": 4}, {"filter_size": 5}],
]
else:
self.stacked_parallel_layers = stacked_layers
for i, parallel_layers in enumerate(self.stacked_parallel_layers):
for j in range(len(parallel_layers)):
layer = parallel_layers[j]
if "num_filters" not in layer:
layer["num_filters"] = default_num_filters
if "filter_size" not in layer:
layer["filter_size"] = default_filter_size
if "strides" not in layer:
layer["strides"] = default_strides
if "padding" not in layer:
layer["padding"] = default_padding
if "dilation_rate" not in layer:
layer["dilation_rate"] = default_dilation_rate
if "use_bias" not in layer:
layer["use_bias"] = default_use_bias
if "weights_initializer" not in layer:
layer["weights_initializer"] = default_weights_initializer
if "bias_initializer" not in layer:
layer["bias_initializer"] = default_bias_initializer
if "norm" not in layer:
layer["norm"] = default_norm
if "norm_params" not in layer:
layer["norm_params"] = default_norm_params
if "activation" not in layer:
layer["activation"] = default_activation
if "dropout" not in layer:
layer["dropout"] = default_dropout
if "pool_function" not in layer:
layer["pool_function"] = default_pool_function
if "pool_size" not in layer:
if i == len(self.stacked_parallel_layers) - 1:
layer["pool_size"] = default_pool_size
else:
layer["pool_size"] = None
if "pool_strides" not in layer:
layer["pool_strides"] = default_pool_strides
if "pool_padding" not in layer:
layer["pool_padding"] = default_pool_padding
self.stack = nn.ModuleList()
num_channels = self.in_channels
sequence_length = self.max_sequence_length
for i, parallel_layers in enumerate(self.stacked_parallel_layers):
logger.debug(f" stack layer {i}")
self.stack.append(ParallelConv1D(num_channels, sequence_length, layers=parallel_layers))
logger.debug(
f"{self.__class__.__name__} layer {i}, input shape "
f"{self.stack[i].input_shape}, output shape "
f"{self.stack[i].output_shape}"
)
# set input specification for the layer
num_channels = self.stack[i].output_shape[1]
sequence_length = self.stack[i].output_shape[0]
@property
def input_shape(self):
"""Returns the size of the input tensor without the batch dimension."""
return torch.Size([self.max_sequence_length, self.in_channels])
def forward(self, inputs, mask=None):
hidden = inputs
for layer in self.stack:
hidden = layer(hidden)
if hidden.shape[2] == 0:
raise ValueError(
"The output of the conv stack has the second dimension "
"(length of the sequence) equal to 0. "
"This means that the combination of filter_size, padding, "
"stride, pool_size, pool_padding and pool_stride is reduces "
"the sequence length more than is possible. "
'Try using "same" padding and reducing or eliminating stride '
"and pool."
)
return hidden
class Conv2DLayer(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
in_channels: int,
out_channels: int = 256,
kernel_size: int | tuple[int] = 3,
stride: int | tuple[int] = 1,
padding: int | tuple[int] | str = "valid",
dilation: int | tuple[int] = 1,
groups: int = 1,
use_bias: bool = True,
padding_mode: str = "zeros",
norm: str | None = None,
norm_params: dict[str, Any] | None = None,
activation: str = "relu",
dropout: float = 0,
pool_function: int = "max",
pool_kernel_size: int | tuple[int] = None,
pool_stride: int | None = None,
pool_padding: int | tuple[int] = 0,
pool_dilation: int | tuple[int] = 1,
):
super().__init__()
self.layers = torch.nn.ModuleList()
self._input_shape = (in_channels, img_height, img_width)
pool_stride = pool_stride or pool_kernel_size
self.layers.append(
nn.Conv2d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
groups=groups,
bias=use_bias,
padding_mode=padding_mode,
)
)
out_height, out_width = get_img_output_shape(img_height, img_width, kernel_size, stride, padding, dilation)
if norm and norm_params is None:
norm_params = {}
if norm == "batch":
# Batch norm over channels
self.layers.append(nn.BatchNorm2d(num_features=out_channels, **norm_params))
elif norm == "layer":
# Layer norm over image height and width
self.layers.append(nn.LayerNorm(normalized_shape=(out_height, out_width), **norm_params))
self.layers.append(get_activation(activation))
if dropout > 0:
self.layers.append(nn.Dropout(dropout))
if pool_kernel_size is not None:
pool = partial(nn.MaxPool2d, dilation=pool_dilation)
if pool_function in {"average", "avg", "mean"}:
pool = nn.AvgPool2d
self.layers.append(pool(kernel_size=pool_kernel_size, stride=pool_stride, padding=pool_padding))
out_height, out_width = get_img_output_shape(
img_height=out_height,
img_width=out_width,
kernel_size=pool_kernel_size,
stride=pool_stride,
padding=pool_padding,
dilation=pool_dilation,
)
for layer in self.layers:
logger.debug(f" {layer._get_name()}")
self._output_shape = (out_channels, out_height, out_width)
def forward(self, inputs):
hidden = inputs
for layer in self.layers:
hidden = layer(hidden)
return hidden
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
class Conv2DStack(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
layers: list[dict] | None = None,
num_layers: int | None = None,
first_in_channels: int | None = None,
default_out_channels: int = 256,
default_kernel_size: int | tuple[int] = 3,
default_stride: int | tuple[int] = 1,
default_padding: int | tuple[int] | str = "valid",
default_dilation: int | tuple[int] = 1,
default_groups: int = 1,
default_use_bias: bool = True,
default_padding_mode: str = "zeros",
default_norm: str | None = None,
default_norm_params: dict[str, Any] | None = None,
default_activation: str = "relu",
default_dropout: int = 0,
default_pool_function: int = "max",
default_pool_kernel_size: int | tuple[int] = 2,
default_pool_stride: int | tuple[int] = None,
default_pool_padding: int | tuple[int] = 0,
default_pool_dilation: int | tuple[int] = 1,
):
super().__init__()
# Confirm that all inputs are consistent
first_in_channels = self._check_in_channels(first_in_channels, layers)
default_pool_stride = default_pool_stride or default_pool_kernel_size
if layers is not None and num_layers is not None:
raise Warning("Both layers and num_layers are not None." "Default to using layers.")
if (
first_in_channels is not None
and layers is not None
and len(layers) > 0
and "in_channels" in layers[0]
and layers[0]["in_channels"] != first_in_channels
):
raise Warning(
"Input channels is set via layers[0]['in_channels'] and first_in_channels."
"Default to using first_in_channels."
)
self._input_shape = (first_in_channels, img_height, img_width)
if layers is None:
if num_layers is None:
self.layers = [
{"out_channels": 32},
{"out_channels": 64},
]
else:
self.layers = []
for i in range(num_layers):
self.layers.append(
{
"kernel_size": default_kernel_size,
"out_channels": default_out_channels,
"pool_kernel_size": default_pool_kernel_size,
}
)
else:
self.layers = layers
for layer in self.layers:
if "out_channels" not in layer:
layer["out_channels"] = default_out_channels
if "kernel_size" not in layer:
layer["kernel_size"] = default_kernel_size
if "stride" not in layer:
layer["stride"] = default_stride
if "padding" not in layer:
layer["padding"] = default_padding
if "dilation" not in layer:
layer["dilation"] = default_dilation
if "groups" not in layer:
layer["groups"] = default_groups
if "use_bias" not in layer:
layer["use_bias"] = default_use_bias
if "padding_mode" not in layer:
layer["padding_mode"] = default_padding_mode
if "norm" not in layer:
layer["norm"] = default_norm
if "norm_params" not in layer:
layer["norm_params"] = default_norm_params
if "activation" not in layer:
layer["activation"] = default_activation
if "dropout" not in layer:
layer["dropout"] = default_dropout
if "pool_function" not in layer:
layer["pool_function"] = default_pool_function
if "pool_kernel_size" not in layer:
layer["pool_kernel_size"] = default_pool_kernel_size
if "pool_stride" not in layer:
layer["pool_stride"] = default_pool_stride
if "pool_padding" not in layer:
layer["pool_padding"] = default_pool_padding
if "pool_dilation" not in layer:
layer["pool_dilation"] = default_pool_dilation
self.stack = torch.nn.ModuleList()
in_channels = first_in_channels
for i, layer in enumerate(self.layers):
logger.debug(f" stack layer {i}")
self.stack.append(
Conv2DLayer(
img_height=img_height,
img_width=img_width,
in_channels=in_channels,
out_channels=layer["out_channels"],
kernel_size=layer["kernel_size"],
stride=layer["stride"],
padding=layer["padding"],
dilation=layer["dilation"],
groups=layer["groups"],
use_bias=layer["use_bias"],
padding_mode=layer["padding_mode"],
norm=layer["norm"],
norm_params=layer["norm_params"],
activation=layer["activation"],
dropout=layer["dropout"],
pool_function=layer["pool_function"],
pool_kernel_size=layer["pool_kernel_size"],
pool_stride=layer["pool_stride"],
pool_padding=layer["pool_padding"],
pool_dilation=layer["pool_dilation"],
)
)
in_channels, img_height, img_width = self.stack[-1].output_shape
self._output_shape = (in_channels, img_height, img_width)
def forward(self, inputs):
hidden = inputs
for layer in self.stack:
hidden = layer(hidden)
return hidden
def _check_in_channels(self, first_in_channels: int | None, layers: list[dict] | None) -> None:
"""Confirms that in_channels for first layer of the stack exists."""
if first_in_channels is not None:
return first_in_channels
elif layers is not None and len(layers) > 0 and "in_channels" in layers[0]:
return layers[0]["in_channels"]
raise ValueError(
"In_channels for first layer should be specified either via " "`first_in_channels` or `layers` arguments."
)
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@property
def input_shape(self) -> torch.Size:
return torch.size(self._input_shape)
class Conv2DLayerFixedPadding(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
in_channels: int,
out_channels=256,
kernel_size=3,
stride=1,
dilation=1,
groups=1,
use_bias=False,
):
super().__init__()
self.layers = torch.nn.ModuleList()
self._input_shape = (in_channels, img_height, img_width)
padding = "same"
if stride > 1:
padding = (kernel_size - 1) // 2
self.layers.append(
nn.Conv2d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
groups=groups,
bias=use_bias,
)
)
img_height, img_width = get_img_output_shape(
img_height=img_height,
img_width=img_width,
kernel_size=kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
)
for layer in self.layers:
logger.debug(f" {layer._get_name()}")
self._output_shape = (out_channels, img_height, img_width)
def forward(self, inputs):
hidden = inputs
for layer in self.layers:
hidden = layer(hidden)
return hidden
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
class ResNetBlock(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
first_in_channels: int,
out_channels: int,
stride: int = 1,
batch_norm_momentum: float = 0.1,
batch_norm_epsilon: float = 0.001,
projection_shortcut: LudwigModule | None = None,
):
"""Resnet blocks used for ResNet34 and smaller.
stride: A single int specifying the stride of the first convolution.
The last convolution will have stride of 1.
"""
super().__init__()
self._input_shape = (first_in_channels, img_height, img_width)
self.conv1 = Conv2DLayerFixedPadding(
img_height=img_height,
img_width=img_width,
in_channels=first_in_channels,
out_channels=out_channels,
kernel_size=3,
stride=stride,
)
in_channels, img_height, img_width = self.conv1.output_shape
self.norm1 = nn.BatchNorm2d(num_features=in_channels, eps=batch_norm_epsilon, momentum=batch_norm_momentum)
self.relu1 = get_activation("relu")
self.conv2 = Conv2DLayerFixedPadding(
img_height=img_height,
img_width=img_width,
in_channels=out_channels,
out_channels=out_channels,
kernel_size=3,
stride=1,
)
self.norm2 = nn.BatchNorm2d(num_features=out_channels, eps=batch_norm_epsilon, momentum=batch_norm_momentum)
self.relu2 = get_activation("relu")
for layer in [self.conv1, self.norm1, self.relu1, self.conv2, self.norm2, self.relu2]:
logger.debug(f" {layer._get_name()}")
self._output_shape = self.conv2.output_shape
self.projection_shortcut = projection_shortcut
if self.projection_shortcut is not None and self.projection_shortcut.output_shape != self._output_shape:
raise ValueError(
f"Output shapes of ResnetBlock and projection_shortcut should "
f"match but are {self._output_shape} and "
f"{self.projection_shortcut.output_shape} respectively."
)
if self.projection_shortcut is None and self._input_shape != self._output_shape:
self.projection_shortcut = Conv2DLayer(
img_height=self._input_shape[1],
img_width=self._input_shape[2],
in_channels=first_in_channels,
out_channels=out_channels,
kernel_size=1,
stride=stride,
)
def forward(self, inputs):
shortcut = inputs
if self.projection_shortcut is not None:
shortcut = self.projection_shortcut(shortcut)
hidden = self.conv1(inputs)
hidden = self.norm1(hidden)
hidden = self.relu1(hidden)
hidden = self.conv2(hidden)
hidden = self.norm2(hidden)
return self.relu2(hidden + shortcut)
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
# TODO(shreya): Combine with ResNetBlock by adding a flag.
class ResNetBottleneckBlock(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
first_in_channels: int,
out_channels: int,
stride: int = 1,
batch_norm_momentum: float = 0.1,
batch_norm_epsilon: float = 0.001,
projection_shortcut: LudwigModule | None = None,
):
"""Resnet bottleneck blocks used for ResNet50 and larger.
stride: A single int specifying the stride of the middle convolution.
The first and last convolution will have stride of 1.
"""
super().__init__()
self._input_shape = (first_in_channels, img_height, img_width)
self.conv1 = Conv2DLayerFixedPadding(
img_height=img_height,
img_width=img_width,
in_channels=first_in_channels,
out_channels=out_channels,
kernel_size=1,
stride=1,
)
in_channels, img_height, img_width = self.conv1.output_shape
self.norm1 = nn.BatchNorm2d(num_features=in_channels, eps=batch_norm_epsilon, momentum=batch_norm_momentum)
self.relu1 = get_activation("relu")
self.conv2 = Conv2DLayerFixedPadding(
img_height=img_height,
img_width=img_width,
in_channels=in_channels,
out_channels=out_channels,
kernel_size=3,
stride=stride,
)
in_channels, img_height, img_width = self.conv2.output_shape
self.norm2 = nn.BatchNorm2d(num_features=in_channels, eps=batch_norm_epsilon, momentum=batch_norm_momentum)
self.relu2 = get_activation("relu")
self.conv3 = Conv2DLayerFixedPadding(
img_height=img_height,
img_width=img_width,
in_channels=in_channels,
out_channels=4 * out_channels,
kernel_size=1,
stride=1,
)
self.norm3 = nn.BatchNorm2d(num_features=4 * out_channels, eps=batch_norm_epsilon, momentum=batch_norm_momentum)
self.relu3 = get_activation("relu")
for layer in [
self.conv1,
self.norm1,
self.relu1,
self.conv2,
self.norm2,
self.relu2,
self.conv3,
self.norm3,
self.relu3,
]:
logger.debug(f" {layer._get_name()}")
self._output_shape = self.conv3.output_shape
self.projection_shortcut = projection_shortcut
if self.projection_shortcut is not None and self.projection_shortcut.output_shape != self._output_shape:
raise ValueError(
f"Output shapes of ResnetBlock and projection_shortcut should "
f"match but are {self._output_shape} and "
f"{self.projection_shortcut.output_shape} respectively."
)
if self.projection_shortcut is None and self._input_shape != self._output_shape:
self.projection_shortcut = Conv2DLayer(
img_height=self._input_shape[1],
img_width=self._input_shape[2],
in_channels=first_in_channels,
out_channels=4 * out_channels,
kernel_size=1,
stride=stride,
)
def forward(self, inputs):
shortcut = inputs
if self.projection_shortcut is not None:
shortcut = self.projection_shortcut(shortcut)
hidden = self.conv1(inputs)
hidden = self.norm1(hidden)
hidden = self.relu1(hidden)
hidden = self.conv2(hidden)
hidden = self.norm2(hidden)
hidden = self.relu2(hidden)
hidden = self.conv3(hidden)
hidden = self.norm3(hidden)
return self.relu3(hidden + shortcut)
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
class ResNetBlockLayer(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
first_in_channels: int,
out_channels: int,
is_bottleneck: bool,
block_fn: ResNetBlock | ResNetBottleneckBlock,
num_blocks: int,
stride: int | tuple[int] = 1,
batch_norm_momentum: float = 0.1,
batch_norm_epsilon: float = 0.001,
):
super().__init__()
self._input_shape = (first_in_channels, img_height, img_width)
# Bottleneck blocks end with 4x the number of channels as they start with
projection_out_channels = out_channels * 4 if is_bottleneck else out_channels
projection_shortcut = Conv2DLayerFixedPadding(
img_height=img_height,
img_width=img_width,
in_channels=first_in_channels,
out_channels=projection_out_channels,
kernel_size=1,
stride=stride,
)
self.layers = torch.nn.ModuleList(
[
block_fn(
img_height,
img_width,
first_in_channels,
out_channels,
stride,
batch_norm_momentum,
batch_norm_epsilon,
projection_shortcut,
)
]
)
in_channels, img_height, img_width = self.layers[-1].output_shape
for _ in range(1, num_blocks):
self.layers.append(
block_fn(
img_height=img_height,
img_width=img_width,
first_in_channels=in_channels,
out_channels=out_channels,
stride=1,
batch_norm_momentum=batch_norm_momentum,
batch_norm_epsilon=batch_norm_epsilon,
)
)
in_channels, img_height, img_width = self.layers[-1].output_shape
for layer in self.layers:
logger.debug(f" {layer._get_name()}")
self._output_shape = (in_channels, img_height, img_width)
def forward(self, inputs):
hidden = inputs
for layer in self.layers:
hidden = layer(hidden)
return hidden
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
class ResNet(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
first_in_channels: int,
out_channels: int,
resnet_size: int = 34,
kernel_size: int | tuple[int] = 7,
conv_stride: int | tuple[int] = 2,
first_pool_kernel_size: int | tuple[int] = 3,
first_pool_stride: int | tuple[int] = 2,
block_sizes: list[int] = None,
block_strides: list[int | tuple[int]] = None,
batch_norm_momentum: float = 0.1,
batch_norm_epsilon: float = 0.001,
):
"""Creates a model obtaining an image representation.
Implements ResNet v2:
Identity Mappings in Deep Residual Networks
https://arxiv.org/pdf/1603.05027.pdf
by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Jul 2016.
Args:
resnet_size: A single integer for the size of the ResNet model.
is_bottleneck: Use regular blocks or bottleneck blocks.
out_channels: The number of filters to use for the first block layer
of the model. This number is then doubled for each subsequent block
layer.
kernel_size: The kernel size to use for convolution.
conv_stride: stride size for the initial convolutional layer
first_pool_kernel_size: Pool size to be used for the first pooling layer.
If none, the first pooling layer is skipped.
first_pool_stride: stride size for the first pooling layer. Not used
if first_pool_kernel_size is None.
block_sizes: A list containing n values, where n is the number of sets of
block layers desired. Each value should be the number of blocks in the
i-th set.
block_strides: List of integers representing the desired stride size for
each of the sets of block layers. Should be same length as block_sizes.
Raises:
ValueError: if invalid version is selected.
"""
super().__init__()
self._input_shape = (first_in_channels, img_height, img_width)
is_bottleneck = self.get_is_bottleneck(resnet_size, block_sizes)
block_class = self.get_block_fn(is_bottleneck)
block_sizes, block_strides = self.get_blocks(resnet_size, block_sizes, block_strides)
self.layers = torch.nn.ModuleList()
self.layers.append(
Conv2DLayerFixedPadding(
img_height=img_height,
img_width=img_width,
in_channels=first_in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=conv_stride,
)
)
in_channels, img_height, img_width = self.layers[-1].output_shape
self.layers.append(
nn.BatchNorm2d(num_features=out_channels, eps=batch_norm_epsilon, momentum=batch_norm_momentum)
)
self.layers.append(get_activation("relu"))
if first_pool_kernel_size:
self.layers.append(nn.MaxPool2d(kernel_size=first_pool_kernel_size, stride=first_pool_stride, padding=1))
img_height, img_width = get_img_output_shape(
img_height=img_height,
img_width=img_width,
kernel_size=first_pool_kernel_size,
stride=first_pool_stride,
padding=1,
dilation=1,
)
for i, num_blocks in enumerate(block_sizes):
self.layers.append(
ResNetBlockLayer(
img_height=img_height,
img_width=img_width,
first_in_channels=in_channels,
out_channels=out_channels,
is_bottleneck=is_bottleneck,
block_fn=block_class,
num_blocks=num_blocks,
stride=block_strides[i],
batch_norm_momentum=batch_norm_momentum,
batch_norm_epsilon=batch_norm_epsilon,
)
)
out_channels *= 2
in_channels, img_height, img_width = self.layers[-1].output_shape
for layer in self.layers:
logger.debug(f" {layer._get_name()}")
self._output_shape = (in_channels, img_height, img_width)
def get_is_bottleneck(self, resnet_size: int, block_sizes: list[int]) -> bool:
if (resnet_size is not None and resnet_size >= 50) or (block_sizes is not None and sum(block_sizes) >= 16):
return True
return False
def get_block_fn(self, is_bottleneck: bool) -> ResNetBlock | ResNetBottleneckBlock:
if is_bottleneck:
return ResNetBottleneckBlock
return ResNetBlock
def get_blocks(self, resnet_size: int, block_sizes: list[int], block_strides: list[int]) -> tuple[list[int]]:
if block_sizes is None:
block_sizes = get_resnet_block_sizes(resnet_size)
if block_strides is None:
block_strides = [1] + [2 for _ in range(len(block_sizes) - 1)]
return block_sizes, block_strides
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
hidden = inputs
for layer in self.layers:
hidden = layer(hidden)
return hidden
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
################################################################################
# The following code for ResNet is adapted from the TensorFlow implementation
# https://github.com/tensorflow/models/blob/master/official/resnet/resnet_model.py
################################################################################
################################################################################
# Convenience functions for building the ResNet model.
################################################################################
resnet_choices = {
8: [1, 2, 2],
14: [1, 2, 2],
18: [2, 2, 2, 2],
34: [3, 4, 6, 3],
50: [3, 4, 6, 3],
101: [3, 4, 23, 3],
152: [3, 8, 36, 3],
200: [3, 24, 36, 3],
}
def get_resnet_block_sizes(resnet_size):
"""Retrieve the size of each block_layer in the ResNet model.
The number of block layers used for the Resnet model varies according
to the size of the model. This helper grabs the layer set we want, throwing
an error if a non-standard size has been selected.
Args:
resnet_size: The number of convolutional layers needed in the model.
Returns:
A list of block sizes to use in building the model.
Raises:
KeyError: if invalid resnet_size is received.
"""
try:
return resnet_choices[resnet_size]
except KeyError:
err = "Could not find layers for selected Resnet size.\n" "Size received: {}; sizes allowed: {}.".format(
resnet_size, resnet_choices.keys()
)
raise ValueError(err)
class UNetDoubleConvLayer(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
in_channels: int,
out_channels: int,
norm: str = None,
):
"""Two Conv2d layers, each followed by a ReLU, used for U-Net.
Args:
img_height: the input image height
img_width: the input image width
in_channels: the number of input channels
out_channels: the number of output channels
norm: the normalization to be applied
"""
super().__init__()
self.layers = nn.ModuleList()
self.layers.append(nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
if norm == "batch":
self.layers.append(nn.BatchNorm2d(out_channels))
self.layers.append(nn.ReLU())
self.layers.append(nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1))
if norm == "batch":
self.layers.append(nn.BatchNorm2d(out_channels))
self.layers.append(nn.ReLU())
self._input_shape = (in_channels, img_height, img_width)
self._output_shape = (out_channels, img_height, img_width)
def forward(self, inputs):
hidden = inputs
for layer in self.layers:
hidden = layer(hidden)
return hidden
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
class UNetDownStack(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
in_channels: int,
norm: str = None,
stack_depth: int = 4,
):
"""Creates the contracting downsampling path of a U-Net stack.
Implements
U-Net: Convolutional Networks for Biomedical Image Segmentation
https://arxiv.org/abs/1505.04597
by Olaf Ronneberger, Philipp Fischer, Thomas Brox, May 2015.
Args:
img_height: the input image height
img_width: the input image width
in_channels: the number of input channels
norm: the normalization to be applied
stack_depth: the depth of the unet stack
"""
super().__init__()
self.conv_layers = nn.ModuleList()
self.down_layers = nn.ModuleList()
height = img_height
width = img_width
in_c = in_channels
out_c = 64
self._input_shape = (in_c, height, width)
for i in range(stack_depth):
self.conv_layers.append(UNetDoubleConvLayer(height, width, in_c, out_c, norm))
in_c = out_c
out_c = out_c * 2
self.down_layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
height = height // 2
width = width // 2
self.bottleneck = UNetDoubleConvLayer(height, width, in_c, out_c, norm)
self._output_shape = (out_c, height, width)
def forward(self, inputs):
skips = [] # skip connections
hidden = inputs
for conv_layer, down_layer in zip(self.conv_layers, self.down_layers):
hidden = conv_layer(hidden)
skips.append(hidden)
hidden = down_layer(hidden)
hidden = self.bottleneck(hidden)
return hidden, skips
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
class UNetUpStack(LudwigModule):
def __init__(
self,
img_height: int,
img_width: int,
out_channels: int,
norm: str = None,
stack_depth: int = 4,
):
"""Creates the expansive upsampling path of a U-Net stack.
Implements
U-Net: Convolutional Networks for Biomedical Image Segmentation
https://arxiv.org/abs/1505.04597
by Olaf Ronneberger, Philipp Fischer, Thomas Brox, May 2015.
Args:
img_height: the output image height
img_width: the output image width
out_channels: the number of output classes
norm: the normalization to be applied
stack_depth: the depth of the unet stack
"""
super().__init__()
self.conv_layers = nn.ModuleList()
self.up_layers = nn.ModuleList()
height = img_height >> stack_depth
width = img_width >> stack_depth
in_c = 64 << stack_depth
out_c = in_c // 2
self._input_shape = (in_c, height, width)
for i in range(stack_depth):
self.up_layers.append(nn.ConvTranspose2d(in_c, out_c, kernel_size=2, stride=2))
height = height * 2
width = width * 2
self.conv_layers.append(UNetDoubleConvLayer(height, width, out_c * 2, out_c, norm))
in_c = out_c
out_c = out_c // 2
self.last_conv = nn.Conv2d(in_c, out_channels, kernel_size=1, padding=0)
self._output_shape = (out_channels, img_height, img_width)
def forward(self, inputs, skips):
hidden = inputs
for conv_layer, up_layer in zip(self.conv_layers, self.up_layers):
hidden = up_layer(hidden)
skip = skips.pop()
hidden = torch.cat([hidden, skip], axis=1)
hidden = conv_layer(hidden)
hidden = self.last_conv(hidden)
return hidden
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
================================================
FILE: ludwig/modules/embedding_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from torch import nn
from ludwig.constants import TYPE
from ludwig.modules.initializer_modules import get_initializer
from ludwig.utils.data_utils import load_pretrained_embeddings
from ludwig.utils.torch_utils import get_torch_device, LudwigModule
logger = logging.getLogger(__name__)
DEVICE = get_torch_device()
def embedding_matrix(
vocab: list[str],
embedding_size: int,
representation: str = "dense",
embeddings_trainable: bool = True,
pretrained_embeddings: str | None = None,
force_embedding_size: bool = False,
embedding_initializer: str | dict | None = None,
) -> tuple[nn.Module, int]:
"""Returns initialized torch.nn.Embedding module and embedding size."""
vocab_size = len(vocab)
if representation == "dense":
if pretrained_embeddings:
embeddings_matrix = load_pretrained_embeddings(pretrained_embeddings, vocab)
if embeddings_matrix.shape[-1] != embedding_size:
if not force_embedding_size:
embedding_size = embeddings_matrix.shape[-1]
logger.info(f"Setting embedding size to be equal to {embeddings_matrix.shape[-1]}.")
else:
raise ValueError(
f"The size of the pretrained embeddings is "
f"{embeddings_matrix.shape[-1]}, but the specified "
f"embedding_size is {embedding_size}. Please change "
f"the embedding_size accordingly."
)
embedding_initializer_obj = torch.tensor(embeddings_matrix, dtype=torch.float32)
else:
if vocab_size < embedding_size and not force_embedding_size:
logger.info(
f" embedding_size ({embedding_size}) is greater than "
f"vocab_size ({vocab_size}). Setting embedding size to be "
f"equal to vocab_size."
)
embedding_size = vocab_size
if embedding_initializer is not None:
embedding_initializer_obj_ref = get_initializer(embedding_initializer)
else:
embedding_initializer_obj_ref = get_initializer({TYPE: "uniform", "a": -1.0, "b": 1.0})
embedding_initializer_obj = embedding_initializer_obj_ref([vocab_size, embedding_size])
embeddings = embedding_initializer_obj
elif representation == "sparse":
embedding_size = vocab_size
embeddings = get_initializer("identity")([vocab_size, embedding_size])
embeddings.requires_grad = False
else:
raise Exception(f"Embedding representation {representation} not supported.")
embeddings = nn.Embedding.from_pretrained(embeddings, freeze=not embeddings_trainable)
return embeddings, embedding_size
def embedding_matrix_on_device(
vocab: list[str],
embedding_size: int,
representation: str = "dense",
embeddings_trainable: bool = True,
pretrained_embeddings: str | None = None,
force_embedding_size: bool = False,
embeddings_on_cpu: bool = False,
embedding_initializer: str | None = None,
) -> tuple[nn.Module, int]:
embeddings, embedding_size = embedding_matrix(
vocab,
embedding_size,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
force_embedding_size=force_embedding_size,
embedding_initializer=embedding_initializer,
)
if embeddings_on_cpu:
embeddings.to("cpu")
elif not embeddings_on_cpu and torch.cuda.is_available():
embeddings.to(device="cuda")
return embeddings, embedding_size
class Embed(LudwigModule):
"""Module to embed Category, Date, and H3 data types."""
def __init__(
self,
vocab: list[str],
embedding_size: int,
representation: str = "dense",
embeddings_trainable: bool = True,
pretrained_embeddings: str | None = None,
force_embedding_size: bool = False,
embeddings_on_cpu: bool = False,
dropout: float = 0.0,
embedding_initializer: str | dict | None = None,
):
super().__init__()
self.supports_masking = True
self.vocab_size = len(vocab)
self.embeddings, self.embedding_size = embedding_matrix_on_device(
vocab,
embedding_size,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
force_embedding_size=force_embedding_size,
embeddings_on_cpu=embeddings_on_cpu,
embedding_initializer=embedding_initializer,
)
if dropout > 0:
self.dropout = torch.nn.Dropout(p=dropout)
else:
self.dropout = None
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
if inputs.ndim != 2 or inputs.shape[1] != 1:
raise RuntimeError(
f"Embed only takes inputs of shape [batch x 1]. Received inputs with size: {inputs.size()}"
)
embedded = self.embeddings(inputs.long())
embedded = torch.squeeze(embedded, dim=1)
if self.dropout:
embedded = self.dropout(embedded)
return embedded
@property
def input_shape(self) -> torch.Size:
return torch.Size([1])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.embedding_size])
class EmbedSet(LudwigModule):
"""Module to embed Set data types, works on multi-hot encoded input."""
def __init__(
self,
vocab: list[str],
embedding_size: int,
representation: str = "dense",
embeddings_trainable: bool = True,
pretrained_embeddings: str | None = None,
force_embedding_size: bool = False,
embeddings_on_cpu: bool = False,
dropout: float = 0.0,
embedding_initializer: str | dict | None = None,
aggregation_function: str = "sum",
):
super().__init__()
self.supports_masking = True
self.vocab_size = len(vocab)
self.embeddings, self.embedding_size = embedding_matrix_on_device(
vocab,
embedding_size,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
force_embedding_size=force_embedding_size,
embeddings_on_cpu=embeddings_on_cpu,
embedding_initializer=embedding_initializer,
)
if dropout > 0:
self.dropout = torch.nn.Dropout(p=dropout)
else:
self.dropout = None
if aggregation_function == "sum":
self.aggregation_function = torch.sum
elif aggregation_function == "avg":
self.aggregation_function = torch.mean
else:
raise ValueError(f"Unsupported aggregation function {aggregation_function}")
self.register_buffer("vocab_indices", torch.arange(self.vocab_size))
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
"""
Params:
inputs: Boolean multi-hot tensor of size [batch x vocab_size], where
inputs[b, i] indicates that token i is present in sample b.
"""
# Convert multi-hot input to input of indices
inputs = inputs.int() * self.vocab_indices
embedded = self.embeddings(inputs.long())
# Mask out the 0th embedding
mask = torch.unsqueeze(inputs, -1)
embedded = embedded * mask
# Sum over all positive tokens
embedded = self.aggregation_function(embedded, dim=1)
if self.dropout:
embedded = self.dropout(embedded)
return embedded
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.vocab_size])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.embedding_size])
@property
def input_dtype(self):
return torch.bool
class EmbedWeighted(LudwigModule):
"""Module to embed Bag data type, works on input of token frequencies."""
def __init__(
self,
vocab: list[str],
embedding_size: int,
representation: str = "dense",
embeddings_trainable: bool = True,
pretrained_embeddings: str | None = None,
force_embedding_size: bool = False,
embeddings_on_cpu: bool = False,
dropout: float = 0.0,
embedding_initializer: str | None = None,
):
super().__init__()
self.embeddings, self.embedding_size = embedding_matrix_on_device(
vocab,
embedding_size,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
force_embedding_size=force_embedding_size,
embeddings_on_cpu=embeddings_on_cpu,
embedding_initializer=embedding_initializer,
)
self.vocab_size = len(vocab)
if dropout > 0:
self.dropout = nn.Dropout(dropout)
else:
self.dropout = None
self.register_buffer("vocab_indices", torch.arange(self.vocab_size, dtype=torch.int32))
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
"""
Params:
inputs: Tensor of frequencies, where inputs[b, i] represents
frequency of token i in sample b of batch.
"""
# Convert to multi-hot input
signed_input = (inputs != 0).type(torch.int32)
multiple_hot_indexes = signed_input * self.vocab_indices
embedded = self.embeddings(multiple_hot_indexes)
# Mask out the 0th embedding
mask = torch.unsqueeze(inputs, -1)
weighted_embedded = embedded * mask
# Sum over the all the positive indices
embedded_reduced = torch.sum(weighted_embedded, dim=1)
if self.dropout:
embedded_reduced = self.dropout(embedded_reduced)
return embedded_reduced
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.vocab_size])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.embedding_size])
class EmbedSequence(LudwigModule):
def __init__(
self,
vocab: list[str],
embedding_size: int,
max_sequence_length: int,
representation: str = "dense",
embeddings_trainable: bool = True,
pretrained_embeddings: str | None = None,
force_embedding_size: bool = False,
embeddings_on_cpu: bool = False,
dropout: float = 0.0,
embedding_initializer: str | None = None,
):
super().__init__()
self.supports_masking = True
self.vocab_size = len(vocab)
self.max_sequence_length = max_sequence_length
self.embeddings, self.embedding_size = embedding_matrix_on_device(
vocab,
embedding_size,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
force_embedding_size=force_embedding_size,
embeddings_on_cpu=embeddings_on_cpu,
embedding_initializer=embedding_initializer,
)
if dropout > 0:
self.dropout = nn.Dropout(dropout)
else:
self.dropout = None
def forward(self, inputs: torch.Tensor, mask: torch.Tensor | None = None):
if inputs.dtype not in [torch.int, torch.long]:
raise RuntimeError(
f"Expected tensor of type torch.int or torch.long as input." f"Received {inputs.dtype} instead."
)
embedded = self.embeddings(inputs)
if self.dropout:
embedded = self.dropout(embedded)
return embedded
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length, self.embedding_size])
class TokenAndPositionEmbedding(LudwigModule):
def __init__(
self,
max_sequence_length,
vocab,
embedding_size,
representation="dense",
embeddings_trainable=True,
pretrained_embeddings=None,
force_embedding_size=False,
embeddings_on_cpu=False,
dropout=0.0,
embedding_initializer=None,
):
super().__init__()
self.max_sequence_length = max_sequence_length
self.embedding_size = embedding_size
self.token_embed = EmbedSequence(
vocab=vocab,
embedding_size=embedding_size,
max_sequence_length=max_sequence_length,
representation=representation,
embeddings_trainable=embeddings_trainable,
pretrained_embeddings=pretrained_embeddings,
force_embedding_size=force_embedding_size,
embeddings_on_cpu=embeddings_on_cpu,
dropout=dropout,
embedding_initializer=embedding_initializer,
)
self.position_embed = nn.Embedding(
num_embeddings=max_sequence_length, embedding_dim=self.token_embed.embedding_size
)
self.register_buffer("positions", torch.arange(0, max_sequence_length))
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.max_sequence_length])
@property
def output_shape(self) -> torch.Size:
return self.token_embed.output_shape
def forward(self, inputs, mask: torch.Tensor | None = None):
positions_hidden = self.position_embed(self.positions)
token_hidden = self.token_embed(inputs)
return token_hidden + positions_hidden
================================================
FILE: ludwig/modules/fully_connected_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from copy import deepcopy
import torch
from torch.nn import Dropout, Linear, ModuleList
from ludwig.modules.normalization_modules import create_norm_layer
from ludwig.utils.torch_utils import activations, initializer_registry, LudwigModule
logger = logging.getLogger(__name__)
class FCLayer(LudwigModule):
"""A torch.nn.Linear wrapper that declares input and output shapes, and enables the customization of:
1. how weights and biases are initialized
2. normalization (layer and batch)
3. activations
4. dropout
"""
def __init__(
self,
input_size: int,
input_rank: int = 2,
output_size: int = 256,
use_bias: bool = True,
weights_initializer: str = "xavier_uniform",
bias_initializer: str = "zeros",
norm: str | None = None,
norm_params: dict | None = None,
activation: str = "relu",
dropout: float = 0,
):
super().__init__()
self.layers = ModuleList()
self.input_size = input_size
self.output_size = output_size
fc = Linear(in_features=input_size, out_features=output_size, bias=use_bias)
self.layers.append(fc)
weights_initializer = initializer_registry[weights_initializer]
weights_initializer(fc.weight)
if use_bias:
bias_initializer = initializer_registry[bias_initializer]
bias_initializer(fc.bias)
if norm is not None:
norm_params = norm_params or {}
self.layers.append(create_norm_layer(norm, input_rank, output_size, **norm_params))
# Dict for activation objects in pytorch?
self.layers.append(activations[activation]())
if dropout > 0:
self.layers.append(Dropout(dropout))
def forward(self, inputs, mask=None):
hidden = inputs
for layer in self.layers:
hidden = layer(hidden)
return hidden
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.output_size])
class FCStack(LudwigModule):
"""A stack of FCLayers.
The specification of each FCLayer is specified by the `layers` dictionary parameter, whose keys correspond with an
FCLayer's constructor arguments, i.e.
[
{"input_size": 2, "output_size": 4},
{"output_size": 4, "use_bias": False},
]
`default_*` parameters dictate default values to use for each FCLayer, if not specified by `layers`. If `layers` is
`None`, then a stack of size `num_layers` of `FCLayer`s configured with all of the `default_*` parameters is used.
If `layers` is None and `num_layers` is 0, then there are no fully connected layers and this module serves as a
trivial passthrough.
"""
def __init__(
self,
first_layer_input_size: int,
layers: list[dict] | None = None,
num_layers: int = 1,
default_input_rank: int = 2,
default_output_size: int = 256,
default_use_bias: bool = True,
default_weights_initializer: str = "xavier_uniform",
default_bias_initializer: str = "zeros",
default_norm: str | None = None,
default_norm_params: dict | None = None,
default_activation: str = "relu",
default_dropout: float = 0,
residual: bool = False,
**kwargs,
):
super().__init__()
self.input_size = first_layer_input_size
self.norm_layer = None
if default_norm is not None:
norm_params = default_norm_params or {}
self.norm_layer = create_norm_layer(default_norm, default_input_rank, self.input_size, **norm_params)
self.dropout = None
if default_dropout > 0:
self.dropout = torch.nn.Dropout(default_dropout)
if layers is None:
self.layers = []
for i in range(num_layers):
self.layers.append({})
else:
# deep copy the layer definitions so that we don't modify the original
self.layers = deepcopy(layers)
if len(self.layers) > 0 and "input_size" not in self.layers[0]:
self.layers[0]["input_size"] = first_layer_input_size
for i, layer in enumerate(self.layers):
if i != 0:
layer["input_size"] = self.layers[i - 1]["output_size"]
if "input_rank" not in layer:
layer["input_rank"] = default_input_rank
if "output_size" not in layer:
layer["output_size"] = default_output_size
if "use_bias" not in layer:
layer["use_bias"] = default_use_bias
if "weights_initializer" not in layer:
layer["weights_initializer"] = default_weights_initializer
if "bias_initializer" not in layer:
layer["bias_initializer"] = default_bias_initializer
if "norm" not in layer:
layer["norm"] = default_norm
if "norm_params" not in layer:
layer["norm_params"] = default_norm_params
if "activation" not in layer:
layer["activation"] = default_activation
if "dropout" not in layer:
layer["dropout"] = default_dropout
self.stack = ModuleList()
for i, layer in enumerate(self.layers):
self.stack.append(
FCLayer(
input_size=layer["input_size"],
input_rank=layer["input_rank"],
output_size=layer["output_size"],
use_bias=layer["use_bias"],
weights_initializer=layer["weights_initializer"],
bias_initializer=layer["bias_initializer"],
norm=layer["norm"],
norm_params=layer["norm_params"],
activation=layer["activation"],
dropout=layer["dropout"],
)
)
self.residual = residual
def forward(self, inputs, mask=None):
hidden = inputs
if self.norm_layer is not None:
hidden = self.norm_layer(hidden)
if self.dropout is not None:
hidden = self.dropout(hidden)
prev_fc_layer_size = self.input_size
for layer in self.stack:
out = layer(hidden)
if self.residual and layer.output_size == prev_fc_layer_size:
hidden = hidden + out
else:
hidden = out
prev_fc_layer_size = layer.layers[0].out_features
return hidden
@property
def num_layers(self) -> int:
return len(self.layers)
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
if len(self.stack) > 0:
return self.stack[-1].output_shape
return torch.Size([self.input_size])
================================================
FILE: ludwig/modules/initializer_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import torch
from ludwig.constants import TYPE
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.torch_utils import initializer_registry
def _create_and_init(init_fn, init_kwargs, *args, **kwargs):
t = torch.empty(*args, **kwargs)
init_fn(t, **init_kwargs)
return t
def get_initializer(parameters):
if parameters is None:
return lambda *args, **kwargs: _create_and_init(initializer_registry[parameters], {}, *args, **kwargs)
elif isinstance(parameters, str):
initializer_fun = get_from_registry(parameters, initializer_registry)
return lambda *args, **kwargs: _create_and_init(initializer_fun, {}, *args, **kwargs)
elif isinstance(parameters, dict):
initializer_fun = get_from_registry(parameters[TYPE], initializer_registry)
init_kwargs = parameters.copy()
del init_kwargs[TYPE]
return lambda *args, **kwargs: _create_and_init(initializer_fun, init_kwargs, *args, **kwargs)
else:
raise ValueError(
f"Initializers parameters should be either strings or dictionaries, "
f"but the provided parameters are a {type(parameters)}. "
f"Parameters values: {parameters}"
)
================================================
FILE: ludwig/modules/loss_implementations/__init__.py
================================================
================================================
FILE: ludwig/modules/loss_implementations/corn.py
================================================
# Source: https://github.com/Raschka-research-group/coral-pytorch/blob/main/coral_pytorch/losses.py
# Sebastian Raschka 2020-2021
# coral_pytorch
# Author: Sebastian Raschka
#
# License: MIT
import torch
import torch.nn.functional as F
def corn_loss(logits, y_train, num_classes):
"""Computes the CORN loss described in our forthcoming 'Deep Neural Networks for Rank Consistent Ordinal
Regression based on Conditional Probabilities' manuscript.
Parameters
----------
logits : torch.tensor, shape=(num_examples, num_classes-1)
Outputs of the CORN layer.
y_train : torch.tensor, shape=(num_examples)
Torch tensor containing the class labels.
num_classes : int
Number of unique class labels (class labels should start at 0).
Returns
----------
loss : torch.tensor
A torch.tensor containing a single loss value.
Examples
----------
>>> # Consider 8 training examples
>>> _ = torch.manual_seed(123)
>>> X_train = torch.rand(8, 99)
>>> y_train = torch.tensor([0, 1, 2, 2, 2, 3, 4, 4])
>>> NUM_CLASSES = 5
>>> #
>>> #
>>> # def __init__(self):
>>> corn_net = torch.nn.Linear(99, NUM_CLASSES-1)
>>> #
>>> #
>>> # def forward(self, X_train):
>>> logits = corn_net(X_train)
>>> logits.shape
torch.Size([8, 4])
>>> corn_loss(logits, y_train, NUM_CLASSES)
tensor(0.7127, grad_fn=)
"""
sets = []
for i in range(num_classes - 1):
label_mask = y_train > i - 1
label_tensor = (y_train[label_mask] > i).to(torch.int64)
sets.append((label_mask, label_tensor))
num_examples = 0
losses = 0.0
for task_index, s in enumerate(sets):
train_examples = s[0]
train_labels = s[1]
if len(train_labels) < 1:
continue
num_examples += len(train_labels)
pred = logits[train_examples, task_index]
loss = -torch.sum(F.logsigmoid(pred) * train_labels + (F.logsigmoid(pred) - pred) * (1 - train_labels))
losses += loss
return losses / num_examples
================================================
FILE: ludwig/modules/loss_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import torch
from torch import nn, Tensor
from torch.nn import HuberLoss as _HuberLoss
from torch.nn import L1Loss
from torch.nn import MSELoss as _MSELoss
from torchmetrics.functional import mean_absolute_percentage_error
import ludwig.utils.loss_utils as utils
from ludwig.constants import LOGITS
from ludwig.modules.loss_implementations.corn import corn_loss
from ludwig.schema.features.loss.loss import (
BaseLossConfig,
BWCEWLossConfig,
CORNLossConfig,
HuberLossConfig,
MAELossConfig,
MAPELossConfig,
MSELossConfig,
NextTokenSoftmaxCrossEntropyLossConfig,
RMSELossConfig,
RMSPELossConfig,
SequenceSoftmaxCrossEntropyLossConfig,
SigmoidCrossEntropyLossConfig,
SoftmaxCrossEntropyLossConfig,
)
from ludwig.utils import strings_utils
from ludwig.utils.registry import Registry
# used for Laplace smoothing for candidate samplers
EPSILON = 1.0e-10
loss_impl_registry = Registry[type[nn.Module]]()
def register_loss(config_cls: type[BaseLossConfig]):
def wrap(cls: type[nn.Module]):
loss_impl_registry[config_cls] = cls
return cls
return wrap
def create_loss(config: BaseLossConfig) -> nn.Module:
return loss_impl_registry[type(config)](config)
class LogitsInputsMixin:
@classmethod
def get_loss_inputs(cls):
"""Maps loss to the desired predicted input type."""
return LOGITS
@register_loss(MSELossConfig)
class MSELoss(_MSELoss, LogitsInputsMixin):
"""Mean squared error."""
def __init__(self, config: MSELossConfig):
super().__init__()
@register_loss(MAELossConfig)
class MAELoss(L1Loss, LogitsInputsMixin):
"""Mean absolute error."""
def __init__(self, config: MAELossConfig):
super().__init__()
@register_loss(MAPELossConfig)
class MAPELoss(nn.Module, LogitsInputsMixin):
"""Mean absolute error."""
def __init__(self, config: MAPELossConfig):
super().__init__()
def forward(self, preds: Tensor, target: Tensor) -> Tensor:
return mean_absolute_percentage_error(preds, target)
@register_loss(RMSELossConfig)
class RMSELoss(nn.Module, LogitsInputsMixin):
"""Root mean square error."""
def __init__(self, config: RMSELossConfig):
super().__init__()
self.mse = nn.MSELoss()
def forward(self, preds: Tensor, target: Tensor) -> Tensor:
return torch.sqrt(self.mse(preds, target))
@register_loss(RMSPELossConfig)
class RMSPELoss(nn.Module, LogitsInputsMixin):
"""Root mean square percentage error."""
def __init__(self, config: RMSPELossConfig):
super().__init__()
def forward(self, preds: Tensor, target: Tensor) -> Tensor:
loss = utils.rmspe_loss(target, preds)
return loss
@register_loss(BWCEWLossConfig)
class BWCEWLoss(nn.Module, LogitsInputsMixin):
"""Binary weighted cross entropy loss."""
def __init__(self, config: BWCEWLossConfig):
super().__init__()
if config.positive_class_weight:
self.loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([config.positive_class_weight]))
else:
self.loss_fn = nn.BCEWithLogitsLoss(pos_weight=config.positive_class_weight)
self.robust_lambda = config.robust_lambda
self.confidence_penalty = config.confidence_penalty
def forward(self, preds: torch.Tensor, target: torch.Tensor):
train_loss = self.loss_fn(preds, target.float())
# robust lambda
if self.robust_lambda > 0:
train_loss = (1 - self.robust_lambda) * train_loss + self.robust_lambda / 2
train_mean_loss = torch.mean(train_loss)
# confidence penalty
if self.confidence_penalty > 0:
probabilities = torch.sigmoid(preds)
mean_penalty = utils.mean_confidence_penalty(probabilities, 2)
train_mean_loss += self.confidence_penalty * mean_penalty
return train_mean_loss
@register_loss(SoftmaxCrossEntropyLossConfig)
class SoftmaxCrossEntropyLoss(nn.Module, LogitsInputsMixin):
def __init__(self, config: SoftmaxCrossEntropyLossConfig):
"""
Params:
class_weights: List or 1D tensor of length equal to number of classes.
"""
super().__init__()
if config.class_weights:
self.loss_fn = nn.CrossEntropyLoss(weight=torch.Tensor(config.class_weights))
else:
self.loss_fn = nn.CrossEntropyLoss()
def forward(self, preds: Tensor, target: Tensor) -> Tensor:
"""
Params:
preds: Tensor of shape [batch x num_classes]
or shape [batch x num_classes x H x W]
target: Tensor of shape [batch], where each element is integral
between 0 and num_classes.
or shape [batch x H x W], where each element is integral
between 0 and num_classes.
"""
if len(target.shape) == 1 or len(target.shape) == 3:
# Assumes we are providing the target as a single class, rather than a distribution
# The target shape can be a 3D tensor [batch x H x W], for image segmentation
target = target.long()
return self.loss_fn(preds, target)
@register_loss(SequenceSoftmaxCrossEntropyLossConfig)
class SequenceSoftmaxCrossEntropyLoss(nn.Module, LogitsInputsMixin):
def __init__(self, config: SequenceSoftmaxCrossEntropyLossConfig):
"""
Params:
class_weights: List or 1D tensor of length equal to number of classes.
"""
super().__init__()
if config.class_weights:
self.loss_fn = nn.CrossEntropyLoss(
weight=torch.Tensor(config.class_weights), ignore_index=strings_utils.SpecialSymbol.PADDING.value
)
else:
self.loss_fn = nn.CrossEntropyLoss(ignore_index=strings_utils.SpecialSymbol.PADDING.value)
def forward(self, preds: Tensor, target: Tensor) -> Tensor:
"""
Params:
preds: Tensor of shape [batch x sequence_length x vocab_size]
target: Tensor of shape [batch x sequence_length], where each element is integral between 0 and vocab_size.
"""
target = target.long()
return self.loss_fn(preds[1:].view(-1, preds.size(-1)), target[1:].view(-1))
@register_loss(NextTokenSoftmaxCrossEntropyLossConfig)
class NextTokenSoftmaxCrossEntropyLoss(nn.Module, LogitsInputsMixin):
def __init__(self, config: NextTokenSoftmaxCrossEntropyLossConfig):
super().__init__()
self.loss_fn = nn.CrossEntropyLoss()
def forward(self, preds: Tensor, target: Tensor) -> Tensor:
"""
Params:
preds: Tensor of shape [batch x sequence_length x vocab_size]
target: Tensor of shape [batch x sequence_length], where each element is integral between 0 and vocab_size.
Reference implementation:
https://github.com/huggingface/transformers/blob/v4.29.1/src/transformers/models/bert/modeling_bert.py#LL1253C1-L1260C1 # noqa
"""
target = target.long()
_, _, vocab_size = preds.shape
# logits for all tensors except n+1 since each logit tensor at position i represents the log probabilities for
# the next token i+1 if we were to do argmax on the logits ensor at position i.
shifted_predictions = preds[:, :-1, :]
# Shift by 1 since the logits at position 0 in predictions represent the log likelihood of target token 1
shifted_targets = target[:, 1:]
return self.loss_fn(shifted_predictions.reshape(-1, vocab_size), shifted_targets.reshape(-1))
@register_loss(SigmoidCrossEntropyLossConfig)
class SigmoidCrossEntropyLoss(nn.Module, LogitsInputsMixin):
def __init__(self, config: SigmoidCrossEntropyLossConfig):
"""
Params:
class_weights: List or 1D tensor of length equal to number of classes.
"""
super().__init__()
if config.class_weights:
self.loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor(config.class_weights))
else:
self.loss_fn = nn.BCEWithLogitsLoss()
def forward(self, preds: Tensor, target: Tensor) -> Tensor:
if preds.ndim != 2:
raise RuntimeError("SigmoidCrossEntropyLoss currently only supported for 2D tensors.")
return self.loss_fn(preds.type(torch.float32), target.type(torch.float32))
@register_loss(HuberLossConfig)
class HuberLoss(_HuberLoss, LogitsInputsMixin):
"""Huber loss."""
def __init__(self, config: HuberLossConfig):
super().__init__(delta=config.delta)
@register_loss(CORNLossConfig)
class CORNLoss(nn.Module, LogitsInputsMixin):
"""CORN loss."""
def __init__(self, config: CORNLossConfig):
super().__init__()
def forward(self, preds: Tensor, target: Tensor) -> Tensor:
num_classes = preds.shape[1]
return corn_loss(preds, target, num_classes=num_classes)
================================================
FILE: ludwig/modules/lr_scheduler.py
================================================
import logging
import math
from collections.abc import Callable
from typing import Any
from torch.optim import Optimizer
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts, LambdaLR, ReduceLROnPlateau, SequentialLR
from ludwig.constants import MINIMIZE, TRAINING, VALIDATION
from ludwig.modules.metric_registry import get_metric_objective
from ludwig.schema.lr_scheduler import LRSchedulerConfig
from ludwig.utils.metric_utils import TrainerMetric
from ludwig.utils.trainer_utils import ProgressTracker
logger = logging.getLogger(__name__)
class ReduceLROnPLateauCappedDecreases(ReduceLROnPlateau):
def __init__(self, optimizer: Optimizer, mode: str, reduce_limit: int, factor: float, patience: int):
super().__init__(optimizer, mode=mode, factor=factor, patience=patience)
self.reduce_limit = reduce_limit
self._num_reduce_lr = 0
def step(self, metrics):
if self._num_reduce_lr >= self.reduce_limit:
# Already reduced the LR as many times as we will allow
return
return super().step(metrics)
@property
def num_reduce_lr(self) -> int:
return self._num_reduce_lr
def _reduce_lr(self, epoch=None):
"""Overrides the base ReduceLROnPlateau implementation."""
self._num_reduce_lr += 1
self.apply_lr()
def apply_lr(self):
if self._num_reduce_lr == 0:
return
for i, param_group in enumerate(self.optimizer.param_groups):
old_lr = float(param_group["lr"])
new_lr = max(old_lr * math.pow(self.factor, self._num_reduce_lr), self.min_lrs[i])
if old_lr - new_lr > self.eps:
param_group["lr"] = new_lr
logger.info(f"From ReduceLROnPLateauCappedDecreases, reducing learning rate to {new_lr}")
class LRScheduler:
def __init__(
self,
config: LRSchedulerConfig,
optimizer: Optimizer,
steps_per_checkpoint: int,
total_steps: int,
):
self.config = config
self.optimizer = optimizer
# Scheduler updated each training step
self.step_info = StepInfo(steps_per_checkpoint, total_steps, self.config)
self._train_scheduler = get_schedule_with_warmup_and_decay(self.config, self.optimizer, self.step_info)
# Scheduler updated each eval step
self._eval_scheduler = None
if self.config.reduce_on_plateau > 0:
mode = "min" if get_metric_objective(self.config.reduce_eval_metric) == MINIMIZE else "max"
self._eval_scheduler = ReduceLROnPLateauCappedDecreases(
optimizer=self.optimizer,
mode=mode,
reduce_limit=self.config.reduce_on_plateau,
factor=self.config.reduce_on_plateau_rate,
patience=self.config.reduce_on_plateau_patience,
)
def step(self):
"""Called every step of training."""
self._train_scheduler.step()
if self._eval_scheduler is not None:
# We apply this scheduler every eval step, not train step, so we don't want to call step() here.
# However, we need to re-apply the LR reduction to the LR from the train scheduler, as the first scheduler
# resets the LR back to the base LR.
self._eval_scheduler.apply_lr()
def eval_step(self, progress_tracker: ProgressTracker, validation_field: str):
"""Called every checkpoint evaluation step."""
if self._eval_scheduler is None:
# No reduce on plateau
return
if self.config.reduce_eval_split == TRAINING:
split_metrics = progress_tracker.train_metrics
elif self.config.reduce_eval_split == VALIDATION:
split_metrics = progress_tracker.validation_metrics
else: # if self.config.reduce_eval_split == TEST:
split_metrics = progress_tracker.test_metrics
validation_metric = self.config.reduce_eval_metric
last_metric: TrainerMetric = split_metrics[validation_field][validation_metric][-1]
last_metric_value = last_metric[-1]
prev_num_reductions = self._eval_scheduler.num_reduce_lr
self._eval_scheduler.step(last_metric_value)
num_reductions = self._eval_scheduler.num_reduce_lr
if num_reductions > prev_num_reductions:
# LR reduction -> update progress tracker
progress_tracker.last_learning_rate_reduction_steps = progress_tracker.steps
progress_tracker.last_learning_rate_reduction = 0
progress_tracker.num_reductions_learning_rate += 1
else:
progress_tracker.last_learning_rate_reduction = (
progress_tracker.steps - progress_tracker.last_learning_rate_reduction_steps
)
def state_dict(self) -> dict[str, Any]:
return {
"train_scheduler_state": self._train_scheduler.state_dict(),
"eval_scheduler_state": self._eval_scheduler.state_dict() if self._eval_scheduler is not None else {},
}
def load_state_dict(self, d: dict[str, Any]):
self._train_scheduler.load_state_dict(d["train_scheduler_state"])
if self._eval_scheduler is not None:
self._eval_scheduler.load_state_dict(d["eval_scheduler_state"])
class StepInfo:
"""Stores the steps_per_checkpoint and total_steps used during the current training run.
This class is needed by LambdaLR to allow us to update the steps on training init without resetting the entire
LRScheduler from scratch (which would result in resetting the optimizer learning rate).
"""
def __init__(self, steps_per_checkpoint: int, total_steps: int, config: LRSchedulerConfig):
self.config = config
self.steps_per_checkpoint = steps_per_checkpoint
self.num_training_steps = total_steps
if self.config.warmup_fraction > 0 and self.config.warmup_evaluations > 0:
logger.info(
"Both `learning_rate_scheduler.warmup_fraction` and `learning_rate_scheduler.warmup_evaluations` "
"provided. The larger of the two (as a function of the total training steps) will be used."
)
num_warmup_steps = 0
if self.config.warmup_fraction > 0:
num_warmup_steps = max(self.config.warmup_fraction * self.num_training_steps, num_warmup_steps)
if self.config.warmup_evaluations > 0:
num_warmup_steps = max(self.config.warmup_evaluations * self.steps_per_checkpoint, num_warmup_steps)
self.num_warmup_steps = num_warmup_steps
def get_schedule_with_warmup_and_decay(
config: LRSchedulerConfig,
optimizer: Optimizer,
step_info: StepInfo,
) -> LambdaLR:
"""Creates a learning rate scheduler that updates each training step."""
schedulers = []
# Warmup scheduler.
if step_info.num_warmup_steps > 0:
warmup_scheduler = LambdaLR(
optimizer,
lambda current_step: float(current_step) / float(max(1, step_info.num_warmup_steps)),
)
schedulers.append(warmup_scheduler)
# Decay scheduler.
decay = config.decay
decay_scheduler = decay_registry[decay](config, optimizer, step_info)
schedulers.append(decay_scheduler)
if len(schedulers) == 1:
# Only one scheduler, so no need to wrap in a SequentialLR.
return schedulers[0]
# Return a SequentialLR that applies the warmup and decay schedulers in order
# with the warmup scheduler only applied for the first num_warmup_steps steps.
return SequentialLR(optimizer, schedulers=schedulers, milestones=[step_info.num_warmup_steps])
def no_decay(current_step: int, num_training_steps: int, num_warmup_steps: int, config: LRSchedulerConfig):
return 1.0
def linear_decay(current_step: int, num_training_steps: int, num_warmup_steps: int, config: LRSchedulerConfig):
return max(
0.0,
float(num_training_steps - num_warmup_steps - current_step)
/ float(max(1, num_training_steps - num_warmup_steps)),
)
def exponential_decay(current_step: int, num_training_steps: int, num_warmup_steps: int, config: LRSchedulerConfig):
decay_rate = float(config.decay_rate)
decay_steps = float(config.decay_steps)
step = float(current_step)
exponent = 1 + step / decay_steps
if config.staircase:
exponent = math.ceil(exponent)
return math.pow(decay_rate, exponent)
def wrap_decay_fn(decay_fn: Callable) -> Callable:
def init_fn(config: LRSchedulerConfig, optimizer: Optimizer, step_info: StepInfo) -> LambdaLR:
return LambdaLR(
optimizer,
lambda current_step: decay_fn(
current_step, step_info.num_training_steps, step_info.num_warmup_steps, config
),
)
return init_fn
def init_cosine_decay(
config: LRSchedulerConfig,
optimizer: Optimizer,
step_info: StepInfo,
) -> CosineAnnealingWarmRestarts:
t_0 = config.t_0
if not t_0:
t_0 = step_info.steps_per_checkpoint
if not t_0:
# A scheduler may be initialized with dummy values like at the start of training.
# Ensure that t_0 != 0, as this causes an error to be raised.
t_0 = 1
return CosineAnnealingWarmRestarts(
optimizer,
T_0=t_0,
T_mult=config.t_mult or 1,
eta_min=config.eta_min or 0,
)
decay_registry = {
None: wrap_decay_fn(no_decay),
"linear": wrap_decay_fn(linear_decay),
"exponential": wrap_decay_fn(exponential_decay),
"cosine": init_cosine_decay,
}
================================================
FILE: ludwig/modules/metric_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import sys
from abc import ABC, abstractmethod
from collections.abc import Callable, Generator
from contextlib import contextmanager
from typing import Any
import torch
from torch import Tensor, tensor
from torchmetrics import MeanAbsoluteError, MeanAbsolutePercentageError
from torchmetrics import MeanMetric as _MeanMetric
from torchmetrics import MeanSquaredError, Metric
from torchmetrics.classification import (
BinaryAccuracy,
BinaryAUROC,
BinaryPrecision,
BinaryRecall,
BinarySpecificity,
MulticlassAccuracy,
MulticlassAUROC,
)
from torchmetrics.functional.regression.r2 import _r2_score_compute, _r2_score_update
from torchmetrics.metric import jit_distributed_available
from torchmetrics.text import BLEUScore, CharErrorRate, WordErrorRate
from torchmetrics.text.perplexity import Perplexity
from torchmetrics.text.rouge import ROUGEScore
from ludwig.constants import ( # RESPONSE,
ACCURACY,
ACCURACY_MICRO,
BINARY,
BINARY_WEIGHTED_CROSS_ENTROPY,
CATEGORY,
CATEGORY_DISTRIBUTION,
CORN,
HITS_AT_K,
HUBER,
IGNORE_INDEX_TOKEN_ID,
IMAGE,
JACCARD,
LOGITS,
LOSS,
MAXIMIZE,
MEAN_ABSOLUTE_ERROR,
MEAN_ABSOLUTE_PERCENTAGE_ERROR,
MEAN_SQUARED_ERROR,
MINIMIZE,
NEXT_TOKEN_PERPLEXITY,
NUMBER,
PERPLEXITY,
PRECISION,
PREDICTIONS,
PROBABILITIES,
R2,
RECALL,
ROC_AUC,
ROOT_MEAN_SQUARED_ERROR,
ROOT_MEAN_SQUARED_PERCENTAGE_ERROR,
SEQUENCE,
SEQUENCE_ACCURACY,
SET,
SPECIFICITY,
TEXT,
TIMESERIES,
TOKEN_ACCURACY,
VECTOR,
)
from ludwig.distributed import get_current_dist_strategy
from ludwig.modules.loss_modules import (
BWCEWLoss,
CORNLoss,
HuberLoss,
NextTokenSoftmaxCrossEntropyLoss,
SequenceSoftmaxCrossEntropyLoss,
SigmoidCrossEntropyLoss,
SoftmaxCrossEntropyLoss,
)
from ludwig.modules.metric_registry import get_metric_objective, get_metric_registry, register_metric
from ludwig.schema.features.loss.loss import (
BWCEWLossConfig,
CORNLossConfig,
HuberLossConfig,
SequenceSoftmaxCrossEntropyLossConfig,
SigmoidCrossEntropyLossConfig,
SoftmaxCrossEntropyLossConfig,
)
from ludwig.utils.loss_utils import rmspe_loss
from ludwig.utils.metric_utils import masked_correct_predictions
from ludwig.utils.torch_utils import sequence_length_2D
logger = logging.getLogger(__name__)
class LudwigMetric(Metric, ABC):
@classmethod
def can_report(cls, feature: "OutputFeature") -> bool: # noqa: F821
return True
@contextmanager
def sync_context(
self,
dist_sync_fn: Callable | None = None,
process_group: Any | None = None,
should_sync: bool = True,
should_unsync: bool = True,
distributed_available: Callable | None = jit_distributed_available,
) -> Generator:
"""Override the behavior of this in the base class to support custom distributed strategies."""
dist_strategy = get_current_dist_strategy()
self.sync(
dist_sync_fn=dist_strategy.gather_all_tensors_fn(),
process_group=process_group,
should_sync=should_sync,
distributed_available=dist_strategy.is_available,
)
yield
self.unsync(should_unsync=self._is_synced and should_unsync)
@register_metric(ROOT_MEAN_SQUARED_ERROR, [NUMBER], MINIMIZE, PREDICTIONS)
class RMSEMetric(MeanSquaredError, LudwigMetric):
"""Root mean squared error metric."""
def __init__(self, **kwargs):
super().__init__(squared=False)
@register_metric(PRECISION, [BINARY], MAXIMIZE, PROBABILITIES)
class PrecisionMetric(BinaryPrecision, LudwigMetric):
"""Precision metric."""
def __init__(self, **kwargs):
super().__init__()
@register_metric(RECALL, [BINARY], MAXIMIZE, PROBABILITIES)
class RecallMetric(BinaryRecall, LudwigMetric):
"""Recall metric."""
def __init__(self, **kwargs):
super().__init__()
@register_metric(ROC_AUC, [BINARY], MAXIMIZE, PROBABILITIES)
class BinaryAUROCMetric(BinaryAUROC, LudwigMetric):
"""Area under the receiver operating curve."""
def __init__(self, **kwargs):
super().__init__()
def update(self, preds: Tensor, target: Tensor) -> None:
super().update(preds, target.type(torch.int8))
@register_metric(ROC_AUC, [CATEGORY, CATEGORY_DISTRIBUTION], MAXIMIZE, PROBABILITIES)
class CategoryAUROCMetric(MulticlassAUROC, LudwigMetric):
"""Area under the receiver operating curve."""
def __init__(self, num_classes: int, **kwargs):
super().__init__(num_classes=num_classes)
def update(self, preds: Tensor, target: Tensor) -> None:
if len(target.shape) > 1:
target = torch.argmax(target, dim=1)
super().update(preds, target)
@register_metric(SPECIFICITY, [BINARY], MAXIMIZE, PROBABILITIES)
class SpecificityMetric(BinarySpecificity, LudwigMetric):
"""Specificity metric."""
def __init__(self, **kwargs):
super().__init__()
class MeanMetric(LudwigMetric):
"""Abstract class for computing mean of metrics."""
def __init__(self, **kwargs):
super().__init__()
self.avg = _MeanMetric()
def update(self, preds: Tensor, target: Tensor) -> None:
self.avg.update(self.get_current_value(preds, target))
def compute(self) -> Tensor:
return self.avg.compute()
def reset(self):
super().reset()
self.avg.reset()
@abstractmethod
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
raise NotImplementedError()
@register_metric(ROOT_MEAN_SQUARED_PERCENTAGE_ERROR, [NUMBER], MINIMIZE, PREDICTIONS)
class RMSPEMetric(MeanMetric):
def __init__(self, **kwargs):
super().__init__()
""" Root mean squared percentage error metric. """
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
return rmspe_loss(target, preds)
@register_metric(R2, [NUMBER, VECTOR, TIMESERIES], MAXIMIZE, PREDICTIONS)
class R2Score(LudwigMetric):
"""Custom R-squared metric implementation that modifies torchmetrics R-squared implementation to return Nan
when there is only sample. This is because R-squared is only defined for two or more samples.
Custom implementation uses code from torchmetrics v0.9.2's implementation of R2: https://github.com/Lightning-
AI/metrics/blob/master/src/torchmetrics/regression/r2.py
"""
def __init__(
self, num_outputs: int = 1, adjusted: int = 0, multioutput: str = "uniform_average", **kwargs: Any
) -> None:
super().__init__(**kwargs)
self.num_outputs = num_outputs
if adjusted < 0 or not isinstance(adjusted, int):
raise ValueError("`adjusted` parameter should be an integer larger or equal to 0.")
self.adjusted = adjusted
allowed_multioutput = ("raw_values", "uniform_average", "variance_weighted")
if multioutput not in allowed_multioutput:
raise ValueError(
f"Invalid input to argument `multioutput`. Choose one of the following: {allowed_multioutput}"
)
self.multioutput = multioutput
self.add_state("sum_squared_error", default=torch.zeros(self.num_outputs), dist_reduce_fx="sum")
self.add_state("sum_error", default=torch.zeros(self.num_outputs), dist_reduce_fx="sum")
self.add_state("residual", default=torch.zeros(self.num_outputs), dist_reduce_fx="sum")
self.add_state("total", default=tensor(0), dist_reduce_fx="sum")
def update(self, preds: Tensor, target: Tensor) -> None:
"""Update state with predictions and targets.
Args:
preds: Predictions from model
target: Ground truth values
"""
sum_squared_error, sum_error, residual, n_obs = _r2_score_update(preds, target)
self.sum_squared_error += sum_squared_error
self.sum_error += sum_error
self.residual += residual
self.total += n_obs
def compute(self) -> Tensor:
"""Computes r2 score over the metric states."""
# self.total maps to the number of observations in preds/target computed during update()
if self.total <= 1:
logger.warning(
"""R-squared (r2) is not defined for one sample. It needs at least two samples. Returning NaN."""
)
return torch.tensor(float("nan"))
return _r2_score_compute(
self.sum_squared_error, self.sum_error, self.residual, self.total, self.adjusted, self.multioutput
)
@register_metric(LOSS, [], MINIMIZE, LOGITS)
class LossMetric(MeanMetric, ABC):
def __init__(self):
super().__init__()
@abstractmethod
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
raise NotImplementedError()
@classmethod
def can_report(cls, feature: "OutputFeature") -> bool: # noqa: F821
return False
@register_metric(BINARY_WEIGHTED_CROSS_ENTROPY, [BINARY], MINIMIZE, LOGITS)
class BWCEWLMetric(LossMetric):
"""Binary Weighted Cross Entropy Weighted Logits Score Metric."""
def __init__(self, config: BWCEWLossConfig, **kwargs):
super().__init__()
self.loss_function = BWCEWLoss(config)
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
return self.loss_function(preds, target)
@register_metric("softmax_cross_entropy", [CATEGORY, CATEGORY_DISTRIBUTION, IMAGE], MINIMIZE, LOGITS)
class SoftmaxCrossEntropyMetric(LossMetric):
def __init__(self, config: SoftmaxCrossEntropyLossConfig, **kwargs):
super().__init__()
self.softmax_cross_entropy_function = SoftmaxCrossEntropyLoss(config)
def get_current_value(self, preds: Tensor, target: Tensor):
return self.softmax_cross_entropy_function(preds, target)
@register_metric("sequence_softmax_cross_entropy", [SEQUENCE, TEXT], MINIMIZE, LOGITS)
class SequenceSoftmaxCrossEntropyMetric(LossMetric):
def __init__(self, config: SequenceSoftmaxCrossEntropyLossConfig, **kwargs):
super().__init__()
self.sequence_softmax_cross_entropy_function = SequenceSoftmaxCrossEntropyLoss(config)
def get_current_value(self, preds: Tensor, target: Tensor):
return self.sequence_softmax_cross_entropy_function(preds, target)
@register_metric("next_token_softmax_cross_entropy", [SEQUENCE, TEXT], MINIMIZE, LOGITS)
class NextTokenSoftmaxCrossEntropyMetric(LossMetric):
def __init__(self, config: SequenceSoftmaxCrossEntropyLossConfig, **kwargs):
super().__init__()
self.next_token_softmax_cross_entropy_function = NextTokenSoftmaxCrossEntropyLoss(config)
def get_current_value(self, preds: Tensor, target: Tensor):
return self.next_token_softmax_cross_entropy_function(preds, target)
@register_metric("sigmoid_cross_entropy", [SET], MINIMIZE, LOGITS)
class SigmoidCrossEntropyMetric(LossMetric):
def __init__(self, config: SigmoidCrossEntropyLossConfig, **kwargs):
super().__init__()
self.sigmoid_cross_entropy_function = SigmoidCrossEntropyLoss(config)
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
return self.sigmoid_cross_entropy_function(preds, target)
@register_metric(TOKEN_ACCURACY, [SEQUENCE, TEXT], MAXIMIZE, PREDICTIONS)
class TokenAccuracyMetric(MeanMetric):
def __init__(self, **kwargs):
super().__init__()
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
target = target.type(preds.dtype)
target_sequence_length = sequence_length_2D(target)
masked_correct_preds = masked_correct_predictions(target, preds, target_sequence_length)
return torch.mean(masked_correct_preds)
@register_metric(SEQUENCE_ACCURACY, [SEQUENCE, TEXT], MAXIMIZE, PREDICTIONS)
class SequenceAccuracyMetric(MeanMetric):
def __init__(self, **kwargs):
super().__init__()
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
return torch.sum(torch.all(preds == target, dim=1)) / target.size()[0]
@register_metric(PERPLEXITY, [SEQUENCE, TEXT], MINIMIZE, PROBABILITIES)
class PerplexityMetric(Perplexity, LudwigMetric):
def __init__(self, **kwargs):
super().__init__(ignore_index=IGNORE_INDEX_TOKEN_ID)
def update(self, preds: Tensor, target: Tensor) -> None:
super().update(preds, target.type(torch.int64))
@register_metric(NEXT_TOKEN_PERPLEXITY, [SEQUENCE, TEXT], MINIMIZE, PROBABILITIES)
class NextTokenPerplexityMetric(MeanMetric):
def __init__(self, **kwargs):
super().__init__()
self.next_token_softmax_cross_entropy_function = NextTokenSoftmaxCrossEntropyLoss({})
def get_current_value(self, preds: Tensor, target: Tensor):
# Perplexity can be represented as the exponential of the cross-entropy loss.
# https://towardsdatascience.com/perplexity-in-language-models-87a196019a94
# We can't use torchmetrics perplexity because it calculates normal cross-entropy
# loss as opposed to shifted cross entropy loss.
shifted_loss = self.next_token_softmax_cross_entropy_function(preds, target)
return torch.exp(shifted_loss)
# @register_metric("bleu", [TEXT], MAXIMIZE, RESPONSE)
# https://github.com/ludwig-ai/ludwig/issues/3953
class BLEUScoreMetric(BLEUScore, LudwigMetric):
def __init__(self, **kwargs):
super().__init__()
# @register_metric("rouge", [TEXT], MAXIMIZE, RESPONSE)
# https://github.com/ludwig-ai/ludwig/issues/3953
class ROUGEScoreMetric(ROUGEScore, LudwigMetric):
def __init__(self, **kwargs):
super().__init__()
# @register_metric("word_error_rate", [TEXT], MINIMIZE, RESPONSE)
# https://github.com/ludwig-ai/ludwig/issues/3953
class WordErrorRateMetric(WordErrorRate, LudwigMetric):
def __init__(self, **kwargs):
super().__init__()
# @register_metric("char_error_rate", [TEXT], MINIMIZE, RESPONSE)
# https://github.com/ludwig-ai/ludwig/issues/3953
class CharErrorRateMetric(CharErrorRate, LudwigMetric):
def __init__(self, **kwargs):
super().__init__()
@register_metric(ACCURACY, [BINARY], MAXIMIZE, PREDICTIONS)
class Accuracy(BinaryAccuracy, LudwigMetric):
"""R-squared metric."""
def __init__(self, **kwargs):
super().__init__()
@register_metric(ACCURACY, [CATEGORY, CATEGORY_DISTRIBUTION], MAXIMIZE, PREDICTIONS)
class CategoryAccuracy(MulticlassAccuracy, LudwigMetric):
def __init__(self, num_classes: int, **kwargs):
super().__init__(num_classes=num_classes)
def update(self, preds: Tensor, target: Tensor) -> None:
if len(target.shape) > 1:
target = torch.argmax(target, dim=1)
super().update(preds, target.type(torch.long))
@register_metric(ACCURACY_MICRO, [CATEGORY, CATEGORY_DISTRIBUTION], MAXIMIZE, PREDICTIONS)
class CategoryAccuracyMicro(MulticlassAccuracy, LudwigMetric):
def __init__(self, num_classes: int, **kwargs):
super().__init__(num_classes=num_classes, average="micro")
def update(self, preds: Tensor, target: Tensor) -> None:
if len(target.shape) > 1:
target = torch.argmax(target, dim=1)
super().update(preds, target.type(torch.long))
@register_metric(HITS_AT_K, [CATEGORY, CATEGORY_DISTRIBUTION], MAXIMIZE, LOGITS)
class HitsAtKMetric(MulticlassAccuracy, LudwigMetric):
def __init__(self, num_classes: int, top_k: int, **kwargs):
super().__init__(num_classes=num_classes, top_k=top_k, **kwargs)
def update(self, preds: Tensor, target: Tensor) -> None:
if len(target.shape) > 1:
target = torch.argmax(target, dim=1)
super().update(preds, target.type(torch.long))
@classmethod
def can_report(cls, feature: "OutputFeature") -> bool: # noqa: F821
return feature.num_classes > feature.top_k
@register_metric(MEAN_ABSOLUTE_ERROR, [NUMBER, VECTOR, TIMESERIES], MINIMIZE, PREDICTIONS)
class MAEMetric(MeanAbsoluteError, LudwigMetric):
def __init__(self, **kwargs):
super().__init__()
def update(self, preds: Tensor, target: Tensor) -> None:
super().update(preds.detach(), target)
@register_metric(MEAN_SQUARED_ERROR, [NUMBER, VECTOR, TIMESERIES], MINIMIZE, PREDICTIONS)
class MSEMetric(MeanSquaredError, LudwigMetric):
def __init__(self, **kwargs):
super().__init__()
def update(self, preds: Tensor, target: Tensor) -> None:
super().update(preds, target)
@register_metric(MEAN_ABSOLUTE_PERCENTAGE_ERROR, [NUMBER, VECTOR, TIMESERIES], MINIMIZE, PREDICTIONS)
class MAPEMetric(MeanAbsolutePercentageError, LudwigMetric):
def __init__(self, **kwargs):
super().__init__()
def update(self, preds: Tensor, target: Tensor) -> None:
super().update(preds, target)
@register_metric(JACCARD, [SET], MAXIMIZE, PROBABILITIES)
class JaccardMetric(MeanMetric):
def __init__(self, threshold: float = 0.5, **kwargs):
super().__init__()
self.threshold = threshold
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
# notation: b is batch size and nc is number of unique elements in the set
# preds: shape [b, nc] probabilities for each class
# target: shape [b, nc] bit-mapped set representation
preds = torch.greater_equal(preds, self.threshold) # now bit-mapped set
target = target.type(torch.bool)
intersection = torch.sum(torch.logical_and(target, preds).type(torch.float32), dim=-1)
union = torch.sum(torch.logical_or(target, preds).type(torch.float32), dim=-1)
return intersection / union # shape [b]
@register_metric(HUBER, [NUMBER, VECTOR, TIMESERIES], MINIMIZE, PREDICTIONS)
class HuberMetric(LossMetric):
def __init__(
self,
config: HuberLossConfig,
**kwargs,
):
super().__init__()
self.loss_function = HuberLoss(config=config)
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
return self.loss_function(preds, target)
@register_metric(CORN, [CATEGORY], MINIMIZE, PREDICTIONS)
class CORNMetric(LossMetric):
def __init__(
self,
config: CORNLossConfig,
**kwargs,
):
super().__init__()
self.loss_function = CORNLoss(config=config)
def get_current_value(self, preds: Tensor, target: Tensor) -> Tensor:
return self.loss_function(preds, target)
def get_metric_cls(metric_name: str) -> type[LudwigMetric]:
return get_metric_registry()[metric_name]
def get_improved_fn(metric: str) -> Callable:
if get_metric_objective(metric) == MINIMIZE:
return lambda x, y: x < y
else:
return lambda x, y: x > y
def get_initial_validation_value(metric: str) -> float:
# Use finite floats instead of inf/-inf so that training_progress.json
# is valid JSON (RFC 8259). sys.float_info.max (~1.8e308) is larger than
# any real metric value, so comparison semantics are identical.
if get_metric_objective(metric) == MINIMIZE:
return sys.float_info.max
else:
return -sys.float_info.max
def get_best_function(metric: str) -> Callable:
if get_metric_objective(metric) == MINIMIZE:
return min
else:
return max
================================================
FILE: ludwig/modules/metric_registry.py
================================================
from typing import Literal, TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import LOGITS, MAXIMIZE, MINIMIZE, PREDICTIONS, PROBABILITIES, RESPONSE
from ludwig.utils.registry import Registry
if TYPE_CHECKING:
from ludwig.modules.metric_modules import LudwigMetric
metric_feature_type_registry = Registry()
metric_registry = Registry()
metric_objective_registry = Registry()
metric_tensor_input_registry = Registry()
def register_metric(
name: str,
feature_types: str | list[str],
objective: Literal[MINIMIZE, MAXIMIZE],
output_feature_tensor_name: Literal[PREDICTIONS, PROBABILITIES, LOGITS],
):
"""Registers a metric class.
Args:
name: The name of the metric. Used in metric reporting and in the config.
feature_types: The feature types that this metric can be used with.
objective: The objective of the metric. Either MINIMIZE or MAXIMIZE.
output_feature_tensor_name: Name of the tensor from output_feature::predictions() that should be used as input.
For example: PREDICTIONS would be used for accuracy metrics while LOGITS would be used for loss metrics.
"""
if isinstance(feature_types, str):
feature_types = [feature_types]
def wrap(cls):
for feature_type in feature_types:
feature_registry = metric_feature_type_registry.get(feature_type, {})
feature_registry[name] = cls
metric_feature_type_registry[feature_type] = feature_registry
metric_registry[name] = cls
metric_objective_registry[name] = objective
metric_tensor_input_registry[name] = output_feature_tensor_name
return cls
return wrap
def get_metric_classes(feature_type: str) -> dict[str, "LudwigMetric"]:
return metric_feature_type_registry[feature_type]
def get_metric_cls(feature_type: str, name: str) -> "LudwigMetric":
return metric_feature_type_registry[feature_type][name]
@DeveloperAPI
def get_metric_feature_type_registry() -> Registry:
return metric_feature_type_registry
@DeveloperAPI
def get_metric_registry() -> Registry:
return metric_registry
@DeveloperAPI
def get_metric(metric_name: str) -> "LudwigMetric": # noqa
return get_metric_registry()[metric_name]
@DeveloperAPI
def get_metrics_for_type(feature_type: str) -> dict[str, "LudwigMetric"]: # noqa
return get_metric_feature_type_registry()[feature_type]
@DeveloperAPI
def get_metric_names_for_type(feature_type: str) -> list[str]:
return sorted(list(get_metric_feature_type_registry()[feature_type].keys()))
@DeveloperAPI
def get_metric_objective(metric_name: str) -> Literal[MINIMIZE, MAXIMIZE]:
return metric_objective_registry[metric_name]
@DeveloperAPI
def get_metric_tensor_input(metric_name: str) -> Literal[PREDICTIONS, PROBABILITIES, LOGITS, RESPONSE]:
return metric_tensor_input_registry[metric_name]
================================================
FILE: ludwig/modules/mlp_mixer_modules.py
================================================
# Copyright (c) 2021 Linux Foundation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import torch
import torch.nn as nn
from ludwig.utils.torch_utils import LudwigModule
class MLP(LudwigModule):
def __init__(
self,
in_features: int | tuple[int],
hidden_size: int,
out_features: int | tuple[int] = None,
dropout: float = 0.0,
):
super().__init__()
out_features = out_features or in_features
self._input_shape = in_features
self._output_shape = out_features
self.linear1 = nn.Linear(in_features=in_features, out_features=hidden_size)
self.linear2 = nn.Linear(in_features=hidden_size, out_features=out_features)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, inputs, **kwargs):
hidden = self.dropout1(nn.functional.gelu(self.linear1(inputs)))
return self.dropout2(self.linear2(hidden))
@property
def input_shape(self) -> torch.Size:
return torch.Size([self._input_shape])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self._output_shape])
class MixerBlock(LudwigModule):
def __init__(self, embed_size: int, n_patches: int, token_dim: int, channel_dim: int, dropout: float = 0.0):
super().__init__()
self._input_shape = (n_patches, embed_size)
self._output_shape = (n_patches, embed_size)
self.mlp1 = MLP(in_features=n_patches, hidden_size=token_dim, dropout=dropout)
self.mlp2 = MLP(in_features=embed_size, hidden_size=channel_dim, dropout=dropout)
self.layernorm1 = nn.LayerNorm(normalized_shape=embed_size)
self.layernorm2 = nn.LayerNorm(normalized_shape=embed_size)
def forward(self, inputs: torch.Tensor, **kwargs):
assert inputs.shape[1:] == self.input_shape
hidden = inputs
hidden = self.layernorm1(hidden).transpose(1, 2)
hidden = self.mlp1(hidden).transpose(1, 2)
mid = hidden + inputs
hidden = self.layernorm2(mid)
hidden = self.mlp2(hidden)
output = hidden + mid
assert output.shape[1:] == self.output_shape
return output
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
@property
def output_shape(self) -> torch.Size:
return torch.Size(self._output_shape)
class MLPMixer(LudwigModule):
"""MLPMixer.
Implements
MLP-Mixer: An all-MLP Architecture for Vision
https://arxiv.org/abs/2105.01601
"""
def __init__(
self,
img_height: int,
img_width: int,
in_channels: int,
patch_size: int = 16,
embed_size: int = 512,
token_size: int = 2048,
channel_dim: int = 256,
num_layers: int = 8,
dropout: float = 0.0,
avg_pool: bool = True,
):
super().__init__()
assert (img_height % patch_size == 0) and (img_width % patch_size == 0)
self._input_shape = (in_channels, img_height, img_width)
n_patches = int(img_height * img_width / (patch_size**2))
self.patch_conv = nn.Conv2d(
in_channels=in_channels, out_channels=embed_size, kernel_size=patch_size, stride=patch_size
)
self.mixer_blocks = nn.ModuleList(
[
MixerBlock(
embed_size=embed_size,
n_patches=n_patches,
token_dim=token_size,
channel_dim=channel_dim,
dropout=dropout,
)
for _ in range(num_layers)
]
)
self.layer_norm = nn.LayerNorm(normalized_shape=embed_size)
self.avg_pool = avg_pool
if self.avg_pool:
self._output_shape = torch.Size((embed_size,))
else:
self._output_shape = torch.Size((n_patches, embed_size))
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
assert inputs.shape[1:] == self.input_shape
hidden = self.patch_conv(inputs)
hidden = hidden.flatten(2).transpose(1, 2)
for mixer_block in self.mixer_blocks:
hidden = mixer_block(hidden)
hidden = self.layer_norm(hidden)
if self.avg_pool:
hidden = torch.mean(hidden, dim=1)
assert hidden.shape[1:] == self.output_shape
return hidden
@property
def input_shape(self) -> torch.Size:
return torch.Size(self._input_shape)
@property
def output_shape(self) -> torch.Size:
return self._output_shape
================================================
FILE: ludwig/modules/normalization_modules.py
================================================
import logging
import numpy as np
import torch
from torch.nn import BatchNorm1d, BatchNorm2d, LayerNorm, Module
from ludwig.utils.torch_utils import LudwigModule
logger = logging.getLogger(__name__)
# implementation adapted from https://github.com/dreamquark-ai/tabnet
class GhostBatchNormalization(LudwigModule):
def __init__(
self, num_features: int, momentum: float = 0.05, epsilon: float = 1e-3, virtual_batch_size: int | None = 128
):
super().__init__()
self.num_features = num_features
self.virtual_batch_size = virtual_batch_size
self.bn = torch.nn.BatchNorm1d(num_features, momentum=momentum, eps=epsilon)
def forward(self, inputs):
batch_size = inputs.shape[0]
if self.training and self.virtual_batch_size:
splits = inputs.chunk(int(np.ceil(batch_size / self.virtual_batch_size)), 0)
if batch_size % self.virtual_batch_size == 1:
# Skip batch normalization for the last chunk if it is size 1.
logger.warning(
f"Virtual batch size `{self.virtual_batch_size}` is not a factor of the batch size `{batch_size}`, "
"resulting in a chunk of size 1. Skipping batch normalization for the last chunk of size 1."
)
if batch_size == 1:
logger.warning(
"Batch size is 1, but batch normalization requires batch size >= 2. Skipping batch normalization."
"Make sure to set `batch_size` to a value greater than 1."
)
# We temporarily set the batch_norm module to eval mode as we can't compute the running statistics
# when the batch size is 1.
self.bn.eval()
splits_with_bn = [self.bn(x) if x.shape[0] >= 1 else x for x in splits]
self.bn.train()
else:
splits_with_bn = [self.bn(x) if x.shape[0] > 1 else x for x in splits]
return torch.cat(splits_with_bn, 0)
if batch_size != 1 or not self.training:
return self.bn(inputs)
return inputs
@property
def moving_mean(self) -> torch.Tensor:
return self.bn.running_mean
@property
def moving_variance(self) -> torch.Tensor:
return self.bn.running_var
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.num_features])
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.num_features])
class BatchNorm1dOrIdentity(BatchNorm1d):
"""BatchNorm1d or Identity layer if the batch_size is 1.
Workaround for: https://github.com/pytorch/pytorch/issues/4534
"""
def forward(self, input: torch.Tensor) -> torch.Tensor:
if input.shape[0] == 1:
logger.warning(
"Batch size is 1, but batch normalization requires batch size >= 2. Skipping batch normalization."
"Make sure to set `batch_size` to a value greater than 1."
)
return input
return super().forward(input)
class BatchNorm2dOrIdentity(BatchNorm2d):
"""BatchNorm2d or Identity layer if the batch_size is 1.
Workaround for: https://github.com/pytorch/pytorch/issues/4534
"""
def forward(self, input: torch.Tensor) -> torch.Tensor:
if input.shape[0] == 1:
logger.warning(
"Batch size is 1, but batch normalization requires batch size >= 2. Skipping batch normalization."
"Make sure to set `batch_size` to a value greater than 1."
)
return input
return super().forward(input)
norm_registry = {
"batch_1d": BatchNorm1dOrIdentity,
"batch_2d": BatchNorm2dOrIdentity,
"layer": LayerNorm,
"ghost": GhostBatchNormalization,
}
def create_norm_layer(norm: str, input_rank: int, num_features: int, **norm_params) -> Module:
if norm == "batch":
# We use a different batch norm depending on the input_rank.
# TODO(travis): consider moving this behind a general BatchNorm interface to avoid this kludge.
if input_rank not in {2, 3}:
ValueError(f"`input_rank` parameter expected to be either 2 or 3, but found {input_rank}.")
norm = f"{norm}_{input_rank - 1}d"
norm_cls = norm_registry.get(norm)
if norm_cls is None:
raise ValueError(
f"Unsupported value for `norm` param: {norm}. Supported values are: {list(norm_registry.keys())}"
)
return norm_cls(num_features, **norm_params)
================================================
FILE: ludwig/modules/optimization_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import dataclasses
from typing import Optional, TYPE_CHECKING
import torch
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.torch_utils import LudwigModule
if TYPE_CHECKING:
from ludwig.schema.optimizers import BaseOptimizerConfig, GradientClippingConfig
def create_clipper(gradient_clipping_config: Optional["GradientClippingConfig"]):
from ludwig.schema.optimizers import GradientClippingConfig
"""Utility function that will convert a None-type gradient clipping config to the correct form."""
if isinstance(gradient_clipping_config, GradientClippingConfig):
return gradient_clipping_config
# Return default config if provided value is None:
return GradientClippingConfig()
def get_optimizer_class_and_kwargs(
optimizer_config: "BaseOptimizerConfig", learning_rate: float
) -> tuple[type[torch.optim.Optimizer], dict]:
"""Returns the optimizer class and kwargs for the optimizer.
:return: Tuple of optimizer class and kwargs for the optimizer.
"""
from ludwig.schema.optimizers import optimizer_registry
# Get the corresponding torch optimizer class for the given config:
optimizer_cls = get_from_registry(optimizer_config.type.lower(), optimizer_registry)[0]
# Create a dict of parameters to be passed to torch (i.e. everything except `type`):
if dataclasses.is_dataclass(optimizer_config):
config_dict = dataclasses.asdict(optimizer_config)
elif hasattr(optimizer_config, "to_dict"):
config_dict = optimizer_config.to_dict()
else:
config_dict = vars(optimizer_config)
cls_kwargs = {field: value for field, value in config_dict.items() if field != "type"}
cls_kwargs["lr"] = learning_rate
return optimizer_cls, cls_kwargs
def create_optimizer(
model: LudwigModule,
optimizer_config: "BaseOptimizerConfig",
learning_rate: float,
) -> torch.optim.Optimizer:
"""Returns a ready-to-use torch optimizer instance based on the given optimizer config.
:param model: Underlying Ludwig model
:param learning_rate: Initial learning rate for the optimizer
:param optimizer_config: Instance of `ludwig.modules.optimization_modules.BaseOptimizerConfig`.
:return: Initialized instance of a torch optimizer.
"""
# Make sure the optimizer is compatible with the available resources:
if (optimizer_config.is_paged or optimizer_config.is_8bit) and (
not torch.cuda.is_available() or torch.cuda.device_count() == 0
):
raise ValueError(
"Cannot use a paged or 8-bit optimizer on a non-GPU machine. "
"Please use a different optimizer or run on a machine with a GPU."
)
optimizer_cls, optimizer_kwargs = get_optimizer_class_and_kwargs(optimizer_config, learning_rate)
return optimizer_cls(model.parameters(), **optimizer_kwargs)
================================================
FILE: ludwig/modules/recurrent_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the 'License');
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from torch.nn import GRU, LSTM, RNN
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.torch_utils import LudwigModule
logger = logging.getLogger(__name__)
rnn_layers_registry = {
"rnn": RNN,
"gru": GRU,
"lstm": LSTM,
}
class RecurrentStack(LudwigModule):
def __init__(
self,
input_size: int = None,
hidden_size: int = 256,
cell_type: str = "rnn",
max_sequence_length: int | None = None,
num_layers: int = 1,
bidirectional: bool = False,
use_bias: bool = True,
dropout: float = 0.0,
**kwargs,
):
super().__init__()
self.supports_masking = True
self.input_size = input_size # api doc: H_in
self.hidden_size = hidden_size # api doc: H_out
self.max_sequence_length = max_sequence_length # api doc: L (sequence length)
rnn_layer_class = get_from_registry(cell_type, rnn_layers_registry)
rnn_params = {"num_layers": num_layers, "bias": use_bias, "dropout": dropout, "bidirectional": bidirectional}
# Delegate recurrent params to PyTorch's RNN/GRU/LSTM implementations.
self.layers = rnn_layer_class(input_size, hidden_size, batch_first=True, **rnn_params)
@property
def input_shape(self) -> torch.Size:
if self.max_sequence_length:
return torch.Size([self.max_sequence_length, self.input_size])
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
hidden_size = self.hidden_size * (2 if self.layers.bidirectional else 1)
if self.max_sequence_length:
return torch.Size([self.max_sequence_length, hidden_size])
return torch.Size([hidden_size])
def forward(self, inputs: torch.Tensor, mask=None):
hidden, final_state = self.layers(inputs)
if isinstance(final_state, tuple):
# lstm cell type
final_state = final_state[0][-1], final_state[1][-1]
else:
# rnn or gru cell type
final_state = final_state[-1]
return hidden, final_state
================================================
FILE: ludwig/modules/reduction_modules.py
================================================
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import torch
from ludwig.modules.attention_modules import FeedForwardAttentionReducer
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.torch_utils import LudwigModule, sequence_length_3D
logger = logging.getLogger(__name__)
class SequenceReducer(LudwigModule):
"""Reduces the sequence dimension of an input tensor according to the specified reduce_mode. Any additional
kwargs are passed on to the reduce mode's constructor. If using reduce_mode=="attention", the input_size kwarg
must also be specified.
A sequence is a tensor of 2 or more dimensions, where the shape is [batch size x sequence length x ...].
:param reduce_mode: The reduction mode, one of {"last", "sum", "mean", "max", "concat", "attention", "none"}
:param max_sequence_length The maximum sequence length. Only used for computation of shapes - inputs passed
at runtime may have a smaller sequence length.
:param encoding_size The size of each sequence element/embedding vector, or None if input is a sequence of scalars.
"""
def __init__(self, reduce_mode: str = None, max_sequence_length: int = 256, encoding_size: int = None, **kwargs):
super().__init__()
# save as private variable for debugging
self._reduce_mode = reduce_mode
self._max_sequence_length = max_sequence_length
self._encoding_size = encoding_size
# If embedding size specified and mode is attention, use embedding size as attention module input size
# unless the input_size kwarg is provided.
if reduce_mode == "attention" and encoding_size and "input_size" not in kwargs:
kwargs["input_size"] = encoding_size
# use registry to find required reduction function
self._reduce_obj = get_from_registry(reduce_mode, reduce_mode_registry)(**kwargs)
def forward(self, inputs, mask=None):
"""Forward pass of reducer.
:param inputs: A tensor of 2 or more dimensions, where the shape is [batch size x sequence length x ...].
:param mask: A mask tensor of 2 dimensions [batch size x sequence length]. Not yet implemented.
:return: The input after applying the reduction operation to sequence dimension.
"""
return self._reduce_obj(inputs, mask=mask)
@property
def input_shape(self) -> torch.Size:
"""Returns size of the input tensor without the batch dimension."""
if self._encoding_size is None:
return torch.Size([self._max_sequence_length])
else:
return torch.Size([self._max_sequence_length, self._encoding_size])
@property
def output_shape(self) -> torch.Size:
"""Returns size of the output tensor without the batch dimension."""
input_shape = self.input_shape
if self._reduce_mode in {None, "none", "None"}:
return input_shape
elif self._reduce_mode == "concat":
if len(input_shape) > 1:
return input_shape[:-2] + (input_shape[-1] * input_shape[-2],)
return input_shape
else:
return input_shape[1:] # Reduce sequence dimension.
class ReduceLast(torch.nn.Module):
def forward(self, inputs, mask=None):
# inputs: [batch_size, seq_size, hidden_size]
batch_size = inputs.shape[0]
# gather the correct outputs from the the RNN outputs (the outputs after sequence_length are all 0s)
# todo: review for generality
sequence_length = sequence_length_3D(inputs) - 1
sequence_length[sequence_length < 0] = 0
gathered = inputs[torch.arange(batch_size), sequence_length.type(torch.int64)]
return gathered
class ReduceSum(torch.nn.Module):
def forward(self, inputs, mask=None):
return torch.sum(inputs, dim=1)
class ReduceMean(torch.nn.Module):
def forward(self, inputs, mask=None):
return torch.mean(inputs, dim=1)
class ReduceMax(torch.nn.Module):
def forward(self, inputs, mask=None):
return torch.amax(inputs, dim=1)
class ReduceConcat(torch.nn.Module):
def forward(self, inputs, mask=None):
if inputs.dim() > 2:
return inputs.reshape(-1, inputs.shape[-1] * inputs.shape[-2])
return inputs
class ReduceNone(torch.nn.Module):
def forward(self, inputs, mask=None):
return inputs
reduce_mode_registry = {
"last": ReduceLast,
"sum": ReduceSum,
"mean": ReduceMean,
"avg": ReduceMean,
"max": ReduceMax,
"concat": ReduceConcat,
"attention": FeedForwardAttentionReducer,
# TODO: Simplify this.
"none": ReduceNone,
"None": ReduceNone,
None: ReduceNone,
}
================================================
FILE: ludwig/modules/tabnet_modules.py
================================================
import torch
import torch.nn as nn
from ludwig.modules.normalization_modules import GhostBatchNormalization
from ludwig.utils.entmax import Entmax15, EntmaxBisect, Sparsemax
from ludwig.utils.torch_utils import LudwigModule
class TabNet(LudwigModule):
def __init__(
self,
input_size: int,
size: int,
output_size: int,
num_steps: int = 1,
num_total_blocks: int = 4,
num_shared_blocks: int = 2,
relaxation_factor: float = 1.5,
bn_momentum: float = 0.3,
bn_epsilon: float = 1e-3,
bn_virtual_bs: int | None = None,
sparsity: float = 1e-5,
entmax_mode: str = "sparsemax",
entmax_alpha: float = 1.5,
):
"""TabNet Will output a vector of size output_dim.
Args:
input_size: concatenated size of input feature encoder outputs
size: Embedding feature dimension
output_size: Output dimension for TabNet
num_steps: Total number of steps.
num_total_blocks: Total number of feature transformer blocks.
num_shared_blocks: Number of shared feature transformer blocks.
relaxation_factor: >1 will allow features to be used more than once.
bn_momentum: Batch normalization, momentum.
bn_epsilon: Batch normalization, epsilon.
bn_virtual_bs: Virtual batch ize for ghost batch norm.
entmax_mode: Entmax is a sparse family of probability mapping which generalizes softmax and sparsemax.
entmax_mode controls the sparsity. One of {"sparsemax", "entmax15", "constant", "adaptive"}.
entmax_alpha: Must be a number between 1.0 and 2.0. If entmax_mode is "adaptive", entmax_alpha is used
as the initial value for the learnable parameter.
"""
super().__init__()
self.input_size = input_size
self.size = size
self.output_size = output_size
self.num_steps = num_steps
self.bn_virtual_bs = bn_virtual_bs
self.relaxation_factor = relaxation_factor
self.sparsity = torch.tensor(sparsity)
self.batch_norm = nn.BatchNorm1d(input_size, momentum=bn_momentum, eps=bn_epsilon)
kargs = {
"num_total_blocks": num_total_blocks,
"num_shared_blocks": num_shared_blocks,
"bn_momentum": bn_momentum,
"bn_epsilon": bn_epsilon,
"bn_virtual_bs": bn_virtual_bs,
}
# first feature transformer block is built first
# to get the shared blocks
self.feature_transforms = nn.ModuleList([FeatureTransformer(input_size, size + output_size, **kargs)])
self.attentive_transforms = nn.ModuleList([None])
for i in range(num_steps):
self.feature_transforms.append(
FeatureTransformer(
input_size,
size + output_size,
**kargs,
shared_fc_layers=self.feature_transforms[0].shared_fc_layers,
)
)
# attentive transformers are initialized in build
# because their outputs size depends on the number
# of features that we determine by looking at the
# last dimension of the input tensor
self.attentive_transforms.append(
AttentiveTransformer(
size, input_size, bn_momentum, bn_epsilon, bn_virtual_bs, entmax_mode, entmax_alpha
)
)
self.final_projection = nn.Linear(output_size, output_size)
# Register tensors to be used in forward pass. This is needed in order to move these tensors
# to the correct device (GPU/CPU) during the forward pass.
self.register_buffer("out_accumulator", torch.zeros(output_size))
self.register_buffer("aggregated_mask", torch.zeros(input_size))
self.register_buffer("prior_scales", torch.ones(input_size))
def forward(self, features: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, list[torch.Tensor]]:
if features.dim() != 2:
raise ValueError(f"Expecting incoming tensor to be dim 2, " f"instead dim={features.dim()}")
# shape notation
# i_s: input_size
# s: size
# o_s: output_size
# b_s: batch_size
batch_size = features.shape[0] # b_s
# Tile out_accumulator, aggregated_mask, and prior_scales to add batch dimension.
out_accumulator = torch.tile(self.out_accumulator, (batch_size, 1))
aggregated_mask = torch.tile(self.aggregated_mask, (batch_size, 1))
prior_scales = torch.tile(self.prior_scales, (batch_size, 1))
masks = []
total_entropy = 0.0
if batch_size != 1 or not self.training:
# Skip batch normalization training if the batch size is 1.
features = self.batch_norm(features) # [b_s, i_s]
elif batch_size == 1:
# We temporarily set the batch_norm module to eval mode as we can't compute the running statistics
# when the batch size is 1.
self.batch_norm.eval()
features = self.batch_norm(features) # [b_s, i_s]
self.batch_norm.train()
masked_features = features
x = self.feature_transforms[0](masked_features) # [b_s, s + o_s]
for step_i in range(1, self.num_steps + 1):
#########################
# Attentive Transformer #
#########################
# x in following is shape [b_s, s]
mask_values = self.attentive_transforms[step_i](x[:, self.output_size :], prior_scales) # [b_s, i_s]
# relaxation factor 1 forces the feature to be only used once
prior_scales = prior_scales * (self.relaxation_factor - mask_values) # [b_s, i_s]
# entropy is used to penalize the amount of sparsity
# in feature selection
if self.sparsity.item() != 0.0:
total_entropy += (
torch.mean(torch.sum(-mask_values * torch.log(mask_values + 0.00001), dim=1)) / self.num_steps
)
masks.append(torch.unsqueeze(torch.unsqueeze(mask_values, 0), 3)) # [1, b_s, i_s, 1]
#######################
# Feature Transformer #
#######################
masked_features = torch.multiply(mask_values, features)
x = self.feature_transforms[step_i](masked_features) # [b_s, s + o_s]
# x in following is shape [b_s, o_s]
out = nn.functional.relu(x[:, : self.output_size]) # [b_s, o_s]
out_accumulator += out
# Aggregated masks are used for visualization of the
# feature importance attributes.
scale = torch.sum(out, dim=1, keepdim=True) / self.num_steps
aggregated_mask += mask_values * scale # [b_s, i_s]
final_output = self.final_projection(out_accumulator) # [b_s, o_s]
sparsity_loss = torch.multiply(self.sparsity, total_entropy)
self.update_loss("sparsity_loss", sparsity_loss)
return final_output, aggregated_mask, masks
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.output_size])
class FeatureBlock(LudwigModule):
def __init__(
self,
input_size: int,
size: int,
apply_glu: bool = True,
bn_momentum: float = 0.1,
bn_epsilon: float = 1e-3,
bn_virtual_bs: int = None,
shared_fc_layer: LudwigModule = None,
):
super().__init__()
self.input_size = input_size
self.apply_glu = apply_glu
self.size = size
units = size * 2 if apply_glu else size
# Initialize fc_layer before assigning to shared layer for torchscript compatibilty
self.fc_layer = nn.Linear(input_size, units, bias=False)
if shared_fc_layer is not None:
assert shared_fc_layer.weight.shape == self.fc_layer.weight.shape
self.fc_layer = shared_fc_layer
self.batch_norm = GhostBatchNormalization(
units, virtual_batch_size=bn_virtual_bs, momentum=bn_momentum, epsilon=bn_epsilon
)
def forward(self, inputs):
# shape notation
# i_s: input_size
# s: size
# u: units
# b_s: batch_size
# inputs shape [b_s, i_s]
hidden = self.fc_layer(inputs) # [b_s, u]
hidden = self.batch_norm(hidden) # [b_s, u]
if self.apply_glu:
hidden = nn.functional.glu(hidden, dim=-1) # [bs, s]
return hidden # [b_s, 2*s] if apply_glu else [b_s, s]
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
class AttentiveTransformer(LudwigModule):
def __init__(
self,
input_size: int,
size: int,
bn_momentum: float = 0.1,
bn_epsilon: float = 1e-3,
bn_virtual_bs: int = None,
entmax_mode: str = "sparsemax",
entmax_alpha: float = 1.5,
):
super().__init__()
self.input_size = input_size
self.size = size
self.entmax_mode = entmax_mode
if entmax_mode == "adaptive":
self.register_buffer("trainable_alpha", torch.tensor(entmax_alpha, requires_grad=True))
else:
self.trainable_alpha = entmax_alpha
if self.entmax_mode == "sparsemax":
self.entmax_module = Sparsemax()
elif self.entmax_mode == "entmax15":
self.entmax_module = Entmax15()
else:
self.entmax_module = EntmaxBisect(alpha=self.trainable_alpha)
self.feature_block = FeatureBlock(
input_size,
size,
bn_momentum=bn_momentum,
bn_epsilon=bn_epsilon,
bn_virtual_bs=bn_virtual_bs,
apply_glu=False,
)
def forward(self, inputs, prior_scales):
# shape notation
# i_s: input_size
# s: size
# b_s: batch_size
# inputs shape [b_s, i_s], prior_scales shape [b_s, s]
hidden = self.feature_block(inputs) # [b_s, s]
hidden = hidden * prior_scales # [b_s, s]
# removing the mean to try to avoid numerical instability
# https://github.com/tensorflow/addons/issues/2314
# https://github.com/tensorflow/tensorflow/pull/21183/files
# In (Arik and Pfister, 2019), they call the logits z.
# The mean(logits) can be substracted from logits to make the algorithm
# more numerically stable. the instability in this algorithm comes mostly
# from the z_cumsum. Substacting the mean will cause z_cumsum to be close
# to zero.
# hidden = hidden - tf.math.reduce_mean(hidden, axis=1)[:, tf.newaxis]
return self.entmax_module(hidden)
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.size])
# adapted and modified from:
# https://github.com/ostamand/tensorflow-tabnet/blob/master/tabnet/models/transformers.py
class FeatureTransformer(LudwigModule):
def __init__(
self,
input_size: int,
size: int,
shared_fc_layers: list | None = None,
num_total_blocks: int = 4,
num_shared_blocks: int = 2,
bn_momentum: float = 0.1,
bn_epsilon: float = 1e-3,
bn_virtual_bs: int = None,
):
super().__init__()
if shared_fc_layers is None:
shared_fc_layers = []
self.input_size = input_size
self.num_total_blocks = num_total_blocks
self.num_shared_blocks = num_shared_blocks
self.size = size
kwargs = {
"bn_momentum": bn_momentum,
"bn_epsilon": bn_epsilon,
"bn_virtual_bs": bn_virtual_bs,
}
# build blocks
self.blocks = nn.ModuleList()
for n in range(num_total_blocks):
# Ensure the sizes fed into FeatureBlock are correct regardless of presence of shared_fc_layer
if n == 0:
in_features = input_size
else:
in_features = size
if shared_fc_layers and n < len(shared_fc_layers):
self.blocks.append(FeatureBlock(in_features, size, **kwargs, shared_fc_layer=shared_fc_layers[n]))
else:
self.blocks.append(FeatureBlock(in_features, size, **kwargs))
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
# shape notation
# i_s: input_size
# s: size
# b_s: batch_size
# inputs shape [b_s, i_s]
hidden = self.blocks[0](inputs) # [b_s, s]
for n in range(1, self.num_total_blocks):
hidden = (self.blocks[n](hidden) + hidden) * (0.5**0.5) # [b_s, s]
return hidden # [b_s, s]
@property
def shared_fc_layers(self):
return [self.blocks[i].fc_layer for i in range(self.num_shared_blocks)]
@property
def input_shape(self) -> torch.Size:
return torch.Size([self.input_size])
@property
def output_shape(self) -> torch.Size:
return torch.Size([self.size])
================================================
FILE: ludwig/modules/training_hooks.py
================================================
import logging
from abc import ABC, abstractmethod
import torch
logger = logging.getLogger(__name__)
class TrainingHook(ABC):
"""A base class for training hooks in PyTorch.
This class provides a template for implementing custom training hooks
that can be activated, deactivated, and maintain a handle to the hook.
Attributes:
_hook_handle (Optional[torch.utils.hooks.RemovableHandle]): A handle to the
registered forward hook, initially set to None.
"""
def __init__(self, *args, **kwargs) -> None:
self._hook_handle = None
@abstractmethod
def hook_fn(self, module: torch.nn.Module, inputs: torch.tensor, outputs: torch.Tensor) -> torch.Tensor:
"""Abstract method to be implemented by subclasses. This is the method that defines the custom behavior of
the training hook during a forward pass for the specified module.
Args:
module (nn.Module): The PyTorch module for which the hook is activated.
inputs (torch.Tensor): The input to the module during the forward pass.
outputs (torch.Tensor): The output from the module during the forward pass.
Returns:
torch.Tensor: The output tensor from the module.
Raises:
NotImplementedError: If the method is not implemented in a subclass.
"""
def activate_hook(self, module: torch.nn.Module) -> torch.nn.Module:
"""Activates the training hook for a given module.
Args:
module (nn.Module): The PyTorch module for which the hook is activated.
Returns:
nn.Module: The input module with the training hook activated.
"""
self._hook_handle = module.register_forward_hook(self.hook_fn)
return module
def deactivate_hook(self):
"""Deactivates and removes the training hook."""
if self._hook_handle is not None:
self._hook_handle.remove()
self._hook_handle = None
class NEFTuneHook(TrainingHook):
def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.neftune_noise_alpha = kwargs.get("neftune_noise_alpha")
def hook_fn(self, module: torch.nn.Module, input: torch.Tensor, output: torch.Tensor) -> torch.Tensor:
"""Implements the NEFTune forward pass for the model using forward hooks. Note this works only for
torch.nn. Embedding layers. This method is slightly adapted from the original source code that can be found
here: https://github.com/neelsjain/NEFTune.
The input tensor is ignored since the noise is added to the output of the embedding layer.
Returns:
torch.Tensor: The output tensor from the module.
"""
if module.training:
dims = torch.tensor(output.size(1) * output.size(2))
mag_norm = module.neftune_noise_alpha / torch.sqrt(dims)
output = output + torch.zeros_like(output).uniform_(-mag_norm, mag_norm)
return output
def activate_hook(self, module: torch.nn.Module) -> torch.nn.Module:
"""Activates the neftune as presented in this code and paper:
Code: https://github.com/neelsjain/NEFTune
Paper: https://arxiv.org/abs/2310.05914
Args:
module (nn.Module): The PyTorch module for which the hook is activated.
Returns:
nn.Module: The input module with the training hook activated.
"""
from peft import PeftModel
if isinstance(module, PeftModel):
embeddings = module.base_model.model.get_input_embeddings()
else:
embeddings = module.get_input_embeddings()
embeddings.neftune_noise_alpha = self.neftune_noise_alpha
self._hook_handle = embeddings.register_forward_hook(self.hook_fn)
return module
================================================
FILE: ludwig/predict.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import sys
from ast import literal_eval
import pandas as pd
from ludwig.api import LudwigModel
from ludwig.backend import ALL_BACKENDS, Backend, initialize_backend
from ludwig.callbacks import Callback
from ludwig.constants import FULL, TEST, TRAINING, VALIDATION
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig
logger = logging.getLogger(__name__)
def predict_cli(
model_path: str,
dataset: str | dict | pd.DataFrame = None,
data_format: str = None,
split: str = FULL,
batch_size: int = 128,
generation_config: str | None = None,
skip_save_unprocessed_output: bool = False,
skip_save_predictions: bool = False,
output_directory: str = "results",
gpus: str | int | list[int] = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
callbacks: list[Callback] = None,
backend: Backend | str = None,
logging_level: int = logging.INFO,
**kwargs,
) -> None:
"""Loads pre-trained model to make predictions on the provided data set.
# Inputs
:param model_path: (str) filepath to pre-trained model.
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used in the prediction.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param split: (str, default: `full`) split on which
to perform predictions. Valid values are `'training'`, `'validation'`,
`'test'` and `'full'`.
:param batch_size: (int, default `128`) size of batches for processing.
:param generation_config: (str, default: `None`) a string representing
the parameters for generation required to perform predictions with
an LLM. The string must be a JSON formatted dictionary with keys from
https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig
These will be merged with the generation parameters from the original
model config.
:param skip_save_unprocessed_output: (bool, default: `False`) by default
predictions and their probabilities are saved in both raw
unprocessed numpy files containing tensors and as postprocessed
CSV files (one for each output feature). If this parameter is True,
only the CSV ones are saved and the numpy ones are skipped.
:param skip_save_predictions: (bool, default: `False`) skips saving test
predictions CSV files
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param gpus: (list, default: `None`) list of GPUs that are available
for training.
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow PyTorch
to use multithreading parallelism to improve performance at
the cost of determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param logging_level: (int) Log level that will be sent to stderr.
# Returns
:return: ('None')
"""
model = LudwigModel.load(
model_path,
logging_level=logging_level,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
)
model.predict(
dataset=dataset,
data_format=data_format,
split=split,
batch_size=batch_size,
generation_config=literal_eval(generation_config) if generation_config else None,
skip_save_unprocessed_output=skip_save_unprocessed_output,
skip_save_predictions=skip_save_predictions,
output_directory=output_directory,
return_type="dict",
)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script loads a pretrained model " "and uses it to predict",
prog="ludwig predict",
usage="%(prog)s [options]",
)
# ---------------
# Data parameters
# ---------------
parser.add_argument("--dataset", help="input data file path", required=True)
parser.add_argument(
"--data_format",
help="format of the input data",
default="auto",
choices=[
"auto",
"csv",
"excel",
"feather",
"fwf",
"hdf5",
"html",
"tables",
"json",
"jsonl",
"parquet",
"pickle",
"sas",
"spss",
"stata",
"tsv",
],
)
parser.add_argument(
"-s", "--split", default=FULL, choices=[TRAINING, VALIDATION, TEST, FULL], help="the split to test the model on"
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=True)
parser.add_argument("-gc", "--generation_config", help="generation config (LLMs only)", default=None)
# -------------------------
# Output results parameters
# -------------------------
parser.add_argument(
"-od", "--output_directory", type=str, default="results", help="directory that contains the results"
)
parser.add_argument(
"-ssuo",
"--skip_save_unprocessed_output",
help="skips saving intermediate NPY output files",
action="store_true",
default=False,
)
parser.add_argument(
"-sstp",
"--skip_save_predictions",
help="skips saving predictions CSV files",
action="store_true",
default=False,
)
# ------------------
# Generic parameters
# ------------------
parser.add_argument("-bs", "--batch_size", type=int, default=128, help="size of batches")
# ------------------
# Runtime parameters
# ------------------
parser.add_argument("-g", "--gpus", type=int, default=0, help="list of gpu to use")
parser.add_argument(
"-gml",
"--gpu_memory_limit",
type=float,
default=None,
help="maximum memory fraction [0, 1] allowed to allocate per GPU device",
)
parser.add_argument(
"-dpt",
"--disable_parallel_threads",
action="store_false",
dest="allow_parallel_threads",
help="disable PyTorch from using multithreading for reproducibility",
)
parser.add_argument(
"-b",
"--backend",
help="specifies backend to use for parallel / distributed execution, " "defaults to local execution",
choices=ALL_BACKENDS,
)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("predict", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.predict")
args.backend = initialize_backend(args.backend)
if args.backend.is_coordinator():
print_ludwig("Predict", LUDWIG_VERSION)
logger.info(f"Dataset path: {args.dataset}")
logger.info(f"Model path: {args.model_path}")
logger.info("")
predict_cli(**vars(args))
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/preprocess.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import sys
import pandas as pd
import yaml
from ludwig.api import LudwigModel
from ludwig.backend import ALL_BACKENDS, Backend, initialize_backend
from ludwig.callbacks import Callback
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.data_utils import load_yaml
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig
logger = logging.getLogger(__name__)
def preprocess_cli(
preprocessing_config: str | dict = None,
dataset: str | dict | pd.DataFrame = None,
training_set: str | dict | pd.DataFrame = None,
validation_set: str | dict | pd.DataFrame = None,
test_set: str | dict | pd.DataFrame = None,
training_set_metadata: str | dict = None,
data_format: str = None,
random_seed: int = default_random_seed,
logging_level: int = logging.INFO,
callbacks: list[Callback] = None,
backend: Backend | str = None,
**kwargs
) -> None:
"""*train* defines the entire training procedure used by Ludwig's internals. Requires most of the parameters
that are taken into the model. Builds a full ludwig model and performs the training.
:param preprocessing_config: (Union[str, dict]) in-memory representation of
config or string path to a YAML config file.
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used for training.
If it has a split column, it will be used for splitting (0 for train,
1 for validation, 2 for test), otherwise the dataset will be
randomly split.
:param training_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing training data.
:param validation_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing validation data.
:param test_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing test data.
:param training_set_metadata: (Union[str, dict], default: `None`)
metadata JSON file or loaded metadata. Intermediate preprocessed
structure containing the mappings of the input
dataset created the first time an input file is used in the same
directory with the same name and a '.meta.json' extension.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param experiment_name: (str, default: `'experiment'`) name for
the experiment.
:param model_name: (str, default: `'run'`) name of the model that is
being used.
:param model_load_path: (str, default: `None`) if this is specified the
loaded model will be used as initialization
(useful for transfer learning).
:param model_resume_path: (str, default: `None`) resumes training of
the model from the path specified. The config is restored.
In addition to config, training statistics, loss for each
epoch and the state of the optimizer are restored such that
training can be effectively continued from a previously interrupted
training process.
:param skip_save_training_description: (bool, default: `False`) disables
saving the description JSON file.
:param skip_save_training_statistics: (bool, default: `False`) disables
saving training statistics JSON file.
:param skip_save_model: (bool, default: `False`) disables
saving model weights and hyperparameters each time the model
improves. By default Ludwig saves model weights after each epoch
the validation metric improves, but if the model is really big
that can be time consuming. If you do not want to keep
the weights and just find out what performance a model can get
with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on and the returned model
will have the weights obtained at the end of training, instead of
the weights of the epoch with the best validation performance.
:param skip_save_progress: (bool, default: `False`) disables saving
progress each epoch. By default Ludwig saves weights and stats
after each epoch for enabling resuming of training, but if
the model is really big that can be time consuming and will uses
twice as much space, use this parameter to skip it, but training
cannot be resumed later on.
:param skip_save_log: (bool, default: `False`) disables saving
TensorBoard logs. By default Ludwig saves logs for the TensorBoard,
but if it is not needed turning it off can slightly increase the
overall speed.
:param skip_save_processed_input: (bool, default: `False`) if input
dataset is provided it is preprocessed and cached by saving an HDF5
and JSON files to avoid running the preprocessing again. If this
parameter is `False`, the HDF5 and JSON file are not saved.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param gpus: (list, default: `None`) list of GPUs that are available
for training.
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow PyTorch
to use multithreading parallelism to improve performance at
the cost of determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param random_seed: (int: default: 42) random seed used for weights
initialization, splits and any other random function.
:param logging_level: (int) Log level that will be sent to stderr.
# Return
:return: (`None`)
"""
model = LudwigModel(
config=preprocessing_config,
logging_level=logging_level,
callbacks=callbacks,
backend=backend,
)
model.preprocess(
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
skip_save_processed_input=False,
random_seed=random_seed,
)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script preprocess a dataset", prog="ludwig preprocess", usage="%(prog)s [options]"
)
# ---------------
# Data parameters
# ---------------
parser.add_argument(
"--dataset",
help="input data file path. "
"If it has a split column, it will be used for splitting "
"(0: train, 1: validation, 2: test), "
"otherwise the dataset will be randomly split",
)
parser.add_argument("--training_set", help="input train data file path")
parser.add_argument("--validation_set", help="input validation data file path")
parser.add_argument("--test_set", help="input test data file path")
parser.add_argument(
"--training_set_metadata",
help="input metadata JSON file path. An intermediate preprocessed file "
"containing the mappings of the input file created "
"the first time a file is used, in the same directory "
"with the same name and a .json extension",
)
parser.add_argument(
"--data_format",
help="format of the input data",
default="auto",
choices=[
"auto",
"csv",
"excel",
"feather",
"fwf",
"hdf5",
"html" "tables",
"json",
"jsonl",
"parquet",
"pickle",
"sas",
"spss",
"stata",
"tsv",
],
)
# ----------------
# Model parameters
# ----------------
preprocessing_def = parser.add_mutually_exclusive_group(required=True)
preprocessing_def.add_argument(
"-pc",
"--preprocessing_config",
dest="preprocessing_config",
type=load_yaml,
help="YAML file describing the preprocessing. "
"Ignores --preprocessing_config."
"Uses the same format of config, "
"but ignores encoder specific parameters, "
"decoder specific parameters, combiner and training parameters",
)
preprocessing_def.add_argument(
"-pcs",
"--preprocessing_config_str",
type=yaml.safe_load,
help="preproceesing config. "
"Uses the same format of config, "
"but ignores encoder specific parameters, "
"decoder specific parameters, combiner and training parameters",
)
# ------------------
# Runtime parameters
# ------------------
parser.add_argument(
"-rs",
"--random_seed",
type=int,
default=42,
help="a random seed that is going to be used anywhere there is a call "
"to a random number generator: data splitting, parameter "
"initialization and training set shuffling",
)
parser.add_argument(
"-b",
"--backend",
help="specifies backend to use for parallel / distributed execution, " "defaults to local execution",
choices=ALL_BACKENDS,
)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("preprocess", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.preprocess")
args.backend = initialize_backend(args.backend)
if args.backend.is_coordinator():
print_ludwig("Preprocess", LUDWIG_VERSION)
preprocess_cli(**vars(args))
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/progress_bar.py
================================================
import uuid
import tqdm
try:
import ray.train as rt
except ImportError:
rt = None
class LudwigProgressBarActions:
CREATE = "create"
UPDATE = "update"
CLOSE = "close"
class LudwigProgressBar:
"""Class for progress bars that supports distributed progress bars in ray.
# Inputs
:param report_to_ray: (bool) use the ray.train.report method
to report progress to the ray driver. If false then this behaves as a normal tqdm
progress bar
:param config: (dict) the tqdm configs used for the progress bar. See https://github.com/tqdm/tqdm#parameters
for list of parameters
:param is_coordinator: (bool) whether the calling process is the coordinator process.
# Example usage:
```python
from ludwig.progress_bar import LudwigProgressBar
config = {"total": 20, "desc": "Sample progress bar"}
pbar = LudwigProgressBar(report_to_ray=False, config=config, is_coordinator=True)
for i in range(20):
pbar.update(1)
pbar.close()
```
"""
def __init__(
self,
report_to_ray: bool,
config: dict,
is_coordinator: bool,
) -> None:
"""Constructor for the LudwigProgressBar class.
# Inputs
:param report_to_ray: (bool) use the ray.train.report method
to report progress to the ray driver. If false then this behaves as a normal tqdm
progress bar
:param config: (dict) the tqdm configs used for the progress bar. See https://github.com/tqdm/tqdm#parameters
for list of parameters
:param is_coordinator: (bool) whether the calling process is the coordinator process.
# Return
:return: (None) `None`
"""
if report_to_ray and rt is None:
raise ValueError("Set report_to_ray=True but ray is not installed. Run `pip install ray`")
self.id = str(uuid.uuid4())[-8:]
self.report_to_ray = report_to_ray
self.is_coordinator = is_coordinator
self.config = config
self.total_steps = 0
self.progress_bar = None
if not self.report_to_ray:
if self.is_coordinator:
self.progress_bar = tqdm.tqdm(**config)
else:
if "file" in self.config:
self.config.pop("file")
# All processes need to call ray.train.report since ray has a lock that blocks
# a process when calling report if there are processes that haven't called it. Similar
# to a distributed checkpoint. Therefore we pass the flag to the driver.
# In Ray 2.x, rt.report() only accepts metrics and checkpoint kwargs,
# so we pass progress_bar data inside the metrics dict.
rt.report(
metrics={
"progress_bar": {
"id": self.id,
"config": self.config,
"action": LudwigProgressBarActions.CREATE,
"is_coordinator": self.is_coordinator,
}
}
)
def set_postfix(self, ordered_dict: dict = None, **kwargs) -> None:
"""Sets the postfix (additional stats) for the progress bar."""
if self.progress_bar:
self.progress_bar.set_postfix(ordered_dict, **kwargs)
def update(self, steps: int) -> None:
"""Updates the progress bar.
# Inputs
:param steps: (int) number of steps to update the progress bar by
# Return
:return: (None) `None`
"""
self.total_steps += steps
if self.progress_bar:
self.progress_bar.update(steps)
elif self.report_to_ray:
rt.report(
metrics={
"progress_bar": {
"id": self.id,
"update_by": steps,
"is_coordinator": self.is_coordinator,
"action": LudwigProgressBarActions.UPDATE,
}
}
)
def close(self) -> None:
"""Closes the progress bar.
# Return
:return: (None) `None`
"""
if self.progress_bar:
self.progress_bar.close()
elif self.report_to_ray:
rt.report(
metrics={
"progress_bar": {
"id": self.id,
"is_coordinator": self.is_coordinator,
"action": LudwigProgressBarActions.CLOSE,
}
}
)
================================================
FILE: ludwig/schema/__init__.py
================================================
# TODO(travis): figure out why we need these imports to avoid circular import error
from ludwig.schema.combiners.utils import get_combiner_jsonschema # noqa
from ludwig.schema.features.utils import get_input_feature_jsonschema, get_output_feature_jsonschema # noqa
from ludwig.schema.hyperopt import get_hyperopt_jsonschema # noqa
from ludwig.schema.trainer import get_model_type_jsonschema, get_trainer_jsonschema # noqa
================================================
FILE: ludwig/schema/combiners/__init__.py
================================================
import ludwig.schema.combiners.comparator # noqa: F401
import ludwig.schema.combiners.concat # noqa: F401
import ludwig.schema.combiners.project_aggregate # noqa: F401
import ludwig.schema.combiners.sequence # noqa: F401
import ludwig.schema.combiners.sequence_concat # noqa: F401
import ludwig.schema.combiners.tab_transformer # noqa: F401
import ludwig.schema.combiners.tabnet # noqa: F401
import ludwig.schema.combiners.transformer # noqa: F401
================================================
FILE: ludwig/schema/combiners/base.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class BaseCombinerConfig(schema_utils.BaseMarshmallowConfig):
"""Base combiner config class."""
type: str
================================================
FILE: ludwig/schema/combiners/common_transformer_options.py
================================================
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import common_fields
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class CommonTransformerConfig:
"""Common transformer parameter values."""
dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="Dropout rate for the transformer block.",
parameter_metadata=COMBINER_METADATA["transformer"]["dropout"],
)
transformer_output_size: int = schema_utils.NonNegativeInteger(
default=256,
description="Size of the fully connected layer after self attention in the transformer block. This is usually "
"the same as `hidden_size` and `embedding_size`.",
parameter_metadata=COMBINER_METADATA["transformer"]["transformer_output_size"],
)
hidden_size: int = schema_utils.NonNegativeInteger(
default=256,
description="The number of hidden units of the TransformerStack as well as the dimension that each incoming "
"input feature is projected to before feeding to the TransformerStack.",
parameter_metadata=COMBINER_METADATA["transformer"]["hidden_size"],
)
num_layers: int = schema_utils.PositiveInteger(
default=1,
description="The number of transformer layers.",
parameter_metadata=COMBINER_METADATA["transformer"]["num_layers"],
)
num_heads: int = schema_utils.NonNegativeInteger(
default=8,
description="Number of heads of the self attention in the transformer block.",
parameter_metadata=COMBINER_METADATA["transformer"]["num_heads"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=COMBINER_METADATA["transformer"]["use_bias"],
)
bias_initializer: str | dict = common_fields.BiasInitializerField()
weights_initializer: str | dict = common_fields.WeightsInitializerField()
# TODO(#1673): Add conditional logic for fields like this one:
num_fc_layers: int = schema_utils.NonNegativeInteger(
default=0,
description="The number of stacked fully connected layers (only applies if `reduce_output` is not null).",
parameter_metadata=COMBINER_METADATA["transformer"]["num_fc_layers"],
)
output_size: int = schema_utils.PositiveInteger(
default=256,
description="Output size of a fully connected layer.",
parameter_metadata=COMBINER_METADATA["transformer"]["output_size"],
)
norm: str | None = common_fields.NormField()
norm_params: dict | None = common_fields.NormParamsField()
fc_layers: list[dict[str, Any]] | None = common_fields.FCLayersField()
fc_dropout: float = common_fields.DropoutField()
fc_activation: str = schema_utils.ActivationOptions(
default="relu",
parameter_metadata=COMBINER_METADATA["transformer"]["fc_activation"],
)
fc_residual: bool = common_fields.ResidualField()
================================================
FILE: ludwig/schema/combiners/comparator.py
================================================
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.error import ConfigValidationError
from ludwig.schema import common_fields
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.utils import register_combiner_config
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_combiner_config("comparator")
@ludwig_dataclass
class ComparatorCombinerConfig(BaseCombinerConfig):
"""Parameters for comparator combiner."""
def __post_init__(self):
if self.num_fc_layers == 0 and self.fc_layers is None:
raise ConfigValidationError(
"`combiner.type=comparator` requires at least one fully connected layer. "
"Set `num_fc_layers > 0` or `fc_layers`."
)
if not self.entity_1:
raise ConfigValidationError(
"`combiner.entity_1` is required and must contain as least one input feature name."
)
if not self.entity_2:
raise ConfigValidationError(
"`combiner.entity_2` is required and must contain as least one input feature name."
)
type: str = schema_utils.ProtectedString(
"comparator",
description=COMBINER_METADATA["comparator"]["type"].long_description,
)
entity_1: list[str] = schema_utils.List(
default=None,
description=(
"The list of input feature names `[feature_1, feature_2, ...]` constituting the first entity to compare. "
"*Required*."
),
parameter_metadata=COMBINER_METADATA["comparator"]["entity_1"],
)
entity_2: list[str] = schema_utils.List(
default=None,
description=(
"The list of input feature names `[feature_1, feature_2, ...]` constituting the second entity to compare. "
"*Required*."
),
parameter_metadata=COMBINER_METADATA["comparator"]["entity_2"],
)
dropout: float = common_fields.DropoutField()
activation: str = schema_utils.ActivationOptions(default="relu")
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=COMBINER_METADATA["comparator"]["use_bias"],
)
bias_initializer: str | dict = common_fields.BiasInitializerField()
weights_initializer: str | dict = common_fields.WeightsInitializerField()
num_fc_layers: int = common_fields.NumFCLayersField(default=1)
output_size: int = schema_utils.PositiveInteger(
default=256,
description="Output size of a fully connected layer.",
parameter_metadata=COMBINER_METADATA["comparator"]["output_size"],
)
norm: str | None = common_fields.NormField()
norm_params: dict | None = common_fields.NormParamsField()
fc_layers: list[dict[str, Any]] | None = common_fields.FCLayersField()
================================================
FILE: ludwig/schema/combiners/concat.py
================================================
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import common_fields
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.utils import register_combiner_config
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_combiner_config("concat")
@ludwig_dataclass
class ConcatCombinerConfig(BaseCombinerConfig):
"""Parameters for concat combiner."""
type: str = schema_utils.ProtectedString(
"concat",
description=COMBINER_METADATA["concat"]["type"].long_description,
)
dropout: float = common_fields.DropoutField()
activation: str = schema_utils.ActivationOptions(default="relu")
flatten_inputs: bool = schema_utils.Boolean(
default=False,
description="Whether to flatten input tensors to a vector.",
parameter_metadata=COMBINER_METADATA["concat"]["flatten_inputs"],
)
residual: bool = common_fields.ResidualField()
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=COMBINER_METADATA["concat"]["use_bias"],
)
bias_initializer: str | dict = common_fields.BiasInitializerField()
weights_initializer: str | dict = common_fields.WeightsInitializerField()
num_fc_layers: int = common_fields.NumFCLayersField()
output_size: int = schema_utils.PositiveInteger(
default=256,
description="Output size of a fully connected layer.",
parameter_metadata=COMBINER_METADATA["concat"]["output_size"],
)
norm: str | None = common_fields.NormField()
norm_params: dict | None = common_fields.NormParamsField()
fc_layers: list[dict[str, Any]] | None = common_fields.FCLayersField()
================================================
FILE: ludwig/schema/combiners/project_aggregate.py
================================================
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.utils import register_combiner_config
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_combiner_config("project_aggregate")
@ludwig_dataclass
class ProjectAggregateCombinerConfig(BaseCombinerConfig):
type: str = schema_utils.ProtectedString(
"project_aggregate",
description=COMBINER_METADATA["project_aggregate"]["type"].long_description,
)
projection_size: int = schema_utils.PositiveInteger(
default=128,
description="All combiner inputs are projected to this size before being aggregated.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["projection_size"],
)
residual: bool = schema_utils.Boolean(
default=True,
description="Whether to add residual skip connection between the fully connected layers in the stack.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["residual"],
)
dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout rate to apply to each fully connected layer.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["dropout"],
)
activation: str = schema_utils.ActivationOptions(
default="relu",
description="Activation to apply to each fully connected layer.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["activation"],
)
num_fc_layers: int = schema_utils.NonNegativeInteger(
default=2,
description="Number of fully connected layers after aggregation.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["num_fc_layers"],
)
output_size: int = schema_utils.PositiveInteger(
default=128,
description="Output size of each layer of the stack of fully connected layers.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["output_size"],
)
norm: str | None = schema_utils.StringOptions(
["batch", "layer"],
default="layer",
description="Normalization to apply to each projection and fully connected layer.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["norm"],
)
norm_params: dict | None = schema_utils.Dict(
description="Parameters of the normalization to apply to each projection and fully connected layer.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["norm_params"],
)
fc_layers: list[dict[str, Any]] | None = schema_utils.DictList(
description="Full specification of the fully connected layers after the aggregation. It should be a list of "
"dict, each dict representing one layer of the fully connected layer stack. ",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["fc_layers"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layers use a bias vector.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["use_bias"],
)
bias_initializer: str | dict = schema_utils.InitializerOrDict(
default="zeros",
description="Initializer to use for the bias of the projection and for the fully connected layers.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["bias_initializer"],
)
weights_initializer: str | dict = schema_utils.InitializerOrDict(
default="xavier_uniform",
description="Initializer to use for the weights of the projection and for the fully connected layers.",
parameter_metadata=COMBINER_METADATA["project_aggregate"]["weights_initializer"],
)
================================================
FILE: ludwig/schema/combiners/sequence.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MODEL_ECD, SEQUENCE
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.sequence_concat import MAIN_SEQUENCE_FEATURE_DESCRIPTION
from ludwig.schema.combiners.utils import register_combiner_config
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
"""
SEQUENCE encoders that always return 2D [batch_size, hidden_size] tensors, regardless of how they are parameterized.
These should never be used with modules that expect 3D tensors, such as the SequenceCombiner.
"""
_2D_SEQUENCE_ENCODERS = ["embed"]
@DeveloperAPI
@register_combiner_config("sequence")
@ludwig_dataclass
class SequenceCombinerConfig(BaseCombinerConfig):
"""Parameters for sequence combiner."""
type: str = schema_utils.ProtectedString(
"sequence",
description=COMBINER_METADATA["sequence"]["type"].long_description,
)
main_sequence_feature: str | None = schema_utils.String(
default=None,
allow_none=True,
description=MAIN_SEQUENCE_FEATURE_DESCRIPTION,
parameter_metadata=COMBINER_METADATA["sequence"]["main_sequence_feature"],
)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=SEQUENCE,
default="parallel_cnn",
description="Encoder to apply to `main_sequence_feature`. The encoder must produce"
" a tensor of size [batch_size, sequence_length, hidden_size]",
blocklist=_2D_SEQUENCE_ENCODERS,
)
reduce_output: str | None = schema_utils.ReductionOptions(
default=None,
description="Strategy to use to aggregate the embeddings of the items of the set.",
parameter_metadata=COMBINER_METADATA["sequence"]["reduce_output"],
)
================================================
FILE: ludwig/schema/combiners/sequence_concat.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.utils import register_combiner_config
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
MAIN_SEQUENCE_FEATURE_DESCRIPTION = """
Name of a sequence, text, or time series feature to concatenate the outputs
of the other features to. If no `main_sequence_feature` is specified, the combiner will look through all the features in
the order they are defined in the configuration and will look for a feature with a rank 3 tensor output (sequence, text
or time series). If it cannot find one it will raise an exception, otherwise the output of that feature will be used for
concatenating the other features along the sequence `s` dimension. If there are other input features with a rank 3
output tensor, the combiner will concatenate them alongside the `s` dimension. All sequence-like input features must
have identical `s` dimension, otherwise an error will be thrown.
"""
@DeveloperAPI
@register_combiner_config("sequence_concat")
@ludwig_dataclass
class SequenceConcatCombinerConfig(BaseCombinerConfig):
"""Parameters for sequence concat combiner."""
@staticmethod
def module_name():
return "sequence_concat"
type: str = schema_utils.ProtectedString(
"sequence_concat",
description=COMBINER_METADATA["sequence_concat"]["type"].long_description,
)
main_sequence_feature: str | None = schema_utils.String(
default=None,
allow_none=True,
description=MAIN_SEQUENCE_FEATURE_DESCRIPTION,
parameter_metadata=COMBINER_METADATA["sequence_concat"]["main_sequence_feature"],
)
reduce_output: str | None = schema_utils.ReductionOptions(
default=None,
description="Strategy to use to aggregate the embeddings of the items of the set.",
parameter_metadata=COMBINER_METADATA["sequence_concat"]["reduce_output"],
)
================================================
FILE: ludwig/schema/combiners/tab_transformer.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.common_transformer_options import CommonTransformerConfig
from ludwig.schema.combiners.utils import register_combiner_config
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_combiner_config("tabtransformer")
@ludwig_dataclass
class TabTransformerCombinerConfig(BaseCombinerConfig, CommonTransformerConfig):
"""Parameters for tab transformer combiner."""
type: str = schema_utils.ProtectedString(
"tabtransformer",
description=COMBINER_METADATA["tabtransformer"]["type"].long_description,
)
embed_input_feature_name: str | int | None = schema_utils.Embed(
description="This value controls the size of the embeddings. Valid values are `add` which uses the "
"`hidden_size` value or an integer that is set to a specific value. In the case of an integer "
"value, it must be smaller than hidden_size.",
parameter_metadata=COMBINER_METADATA["tabtransformer"]["embed_input_feature_name"],
)
reduce_output: str = schema_utils.ReductionOptions(
default="concat",
description="Strategy to use to aggregate the output of the transformer.",
parameter_metadata=COMBINER_METADATA["tabtransformer"]["reduce_output"],
)
================================================
FILE: ludwig/schema/combiners/tabnet.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.utils import register_combiner_config
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_combiner_config("tabnet")
@ludwig_dataclass
class TabNetCombinerConfig(BaseCombinerConfig):
"""Parameters for tabnet combiner."""
type: str = schema_utils.ProtectedString(
"tabnet",
description=COMBINER_METADATA["tabnet"]["type"].long_description,
)
size: int = schema_utils.PositiveInteger(
default=32,
description="Size of the hidden layers. `N_a` in (Arik and Pfister, 2019).",
parameter_metadata=COMBINER_METADATA["tabnet"]["size"],
)
dropout: float = schema_utils.FloatRange(
default=0.05,
min=0,
max=1,
description="Dropout rate for the transformer block.",
parameter_metadata=COMBINER_METADATA["tabnet"]["dropout"],
)
output_size: int = schema_utils.PositiveInteger(
default=128,
description="Output size of a fully connected layer. `N_d` in (Arik and Pfister, 2019).",
parameter_metadata=COMBINER_METADATA["tabnet"]["output_size"],
)
num_steps: int = schema_utils.NonNegativeInteger(
default=3,
description="Number of steps / repetitions of the the attentive transformer and feature transformer "
"computations. `N_steps` in (Arik and Pfister, 2019).",
parameter_metadata=COMBINER_METADATA["tabnet"]["num_steps"],
)
num_total_blocks: int = schema_utils.NonNegativeInteger(
default=4,
description="Total number of feature transformer blocks at each step.",
parameter_metadata=COMBINER_METADATA["tabnet"]["num_total_blocks"],
)
num_shared_blocks: int = schema_utils.NonNegativeInteger(
default=2,
description="Number of shared feature transformer blocks across the steps.",
parameter_metadata=COMBINER_METADATA["tabnet"]["num_shared_blocks"],
)
relaxation_factor: float = schema_utils.FloatRange(
default=1.5,
description="Factor that influences how many times a feature should be used across the steps of computation. "
"a value of 1 implies it each feature should be use once, a higher value allows for multiple "
"usages. `gamma` in (Arik and Pfister, 2019).",
parameter_metadata=COMBINER_METADATA["tabnet"]["relaxation_factor"],
)
bn_epsilon: float = schema_utils.FloatRange(
default=1e-3,
description="Epsilon to be added to the batch norm denominator.",
parameter_metadata=COMBINER_METADATA["tabnet"]["bn_epsilon"],
)
bn_momentum: float = schema_utils.FloatRange(
default=0.05,
description="Momentum of the batch norm. 1 - `m_B` from the TabNet paper.",
parameter_metadata=COMBINER_METADATA["tabnet"]["bn_momentum"],
)
bn_virtual_bs: int | None = schema_utils.PositiveInteger(
default=1024,
allow_none=True,
description="Size of the virtual batch size used by ghost batch norm. If null, regular batch norm is used "
"instead. `B_v` from the TabNet paper.",
parameter_metadata=COMBINER_METADATA["tabnet"]["bn_virtual_bs"],
)
sparsity: float = schema_utils.FloatRange(
default=1e-4,
description="Multiplier of the sparsity inducing loss. `lambda_sparse` in (Arik and Pfister, 2019).",
parameter_metadata=COMBINER_METADATA["tabnet"]["sparsity"],
)
entmax_mode: str = schema_utils.StringOptions(
["entmax15", "sparsemax", "constant", "adaptive"],
default="sparsemax",
description=(
"Entmax is a sparse family of probability mapping which generalizes softmax and sparsemax. "
"`entmax_mode` controls the sparsity"
),
parameter_metadata=COMBINER_METADATA["tabnet"]["entmax_mode"],
)
entmax_alpha: float = schema_utils.FloatRange(
default=1.5,
min=1,
max=2,
description=(
"Must be a number between 1.0 and 2.0. If entmax_mode is `adaptive`, "
"`entmax_alpha` is used as the initial value for the learnable parameter. "
"1 corresponds to softmax, 2 is sparsemax."
),
parameter_metadata=COMBINER_METADATA["tabnet"]["entmax_alpha"],
)
================================================
FILE: ludwig/schema/combiners/transformer.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.common_transformer_options import CommonTransformerConfig
from ludwig.schema.combiners.utils import register_combiner_config
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_combiner_config("transformer")
@ludwig_dataclass
class TransformerCombinerConfig(BaseCombinerConfig, CommonTransformerConfig):
"""Parameters for transformer combiner."""
type: str = schema_utils.ProtectedString(
"transformer",
description=COMBINER_METADATA["transformer"]["type"].long_description,
)
reduce_output: str | None = schema_utils.ReductionOptions(
default="mean",
description="Strategy to use to aggregate the output of the transformer.",
parameter_metadata=COMBINER_METADATA["transformer"]["reduce_output"],
)
================================================
FILE: ludwig/schema/combiners/utils.py
================================================
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import TYPE
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.metadata import COMBINER_METADATA
from ludwig.schema.metadata.parameter_metadata import convert_metadata_to_json, ParameterMetadata
from ludwig.utils.registry import Registry
DEFAULT_VALUE = "concat"
DESCRIPTION = "Select the combiner type."
combiner_config_registry = Registry[type[BaseCombinerConfig]]()
@DeveloperAPI
def register_combiner_config(name: str):
def wrap(cls: type[BaseCombinerConfig]):
combiner_config_registry[name] = cls
return cls
return wrap
@DeveloperAPI
def get_combiner_registry():
return combiner_config_registry
@DeveloperAPI
def get_combiner_jsonschema():
"""Returns a JSON schema structured to only require a `type` key and then conditionally apply a corresponding
combiner's field constraints."""
combiner_types = sorted(list(combiner_config_registry.keys()))
parameter_metadata = convert_metadata_to_json(
ParameterMetadata.from_dict(
{
"commonly_used": True,
"expected_impact": 3,
"ui_display_name": "Combiner Type",
}
)
)
return {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": combiner_types,
"enumDescriptions": get_combiner_descriptions(),
"default": DEFAULT_VALUE,
"title": "combiner_options",
"description": DESCRIPTION,
"parameter_metadata": parameter_metadata,
},
},
"allOf": get_combiner_conds(),
"required": ["type"],
}
@DeveloperAPI
def get_combiner_descriptions():
"""This function returns a dictionary of combiner descriptions available at the type selection.
The process works as follows - 1) Get a dictionary of valid combiners from the combiner config registry,
but inverse the key/value pairs since we need to index `valid_combiners` later with an altered version
of the combiner config class name. 2) Loop through Combiner Metadata entries, if a metadata entry has a
combiner name that matches a valid combiner, add the description metadata to the output dictionary.
Returns:
dict: A dictionary of combiner descriptions.
"""
return {k: convert_metadata_to_json(v[TYPE]) for k, v in COMBINER_METADATA.items() if k in combiner_config_registry}
@DeveloperAPI
def get_combiner_conds() -> list[dict[str, Any]]:
"""Returns a list of if-then JSON clauses for each combiner type in `combiner_registry` and its properties'
constraints."""
combiner_types = sorted(list(combiner_config_registry.keys()))
conds = []
for combiner_type in combiner_types:
combiner_cls = combiner_config_registry[combiner_type]
schema_cls = combiner_cls
combiner_schema = schema_utils.unload_jsonschema_from_marshmallow_class(schema_cls)
combiner_props = combiner_schema["properties"]
schema_utils.remove_duplicate_fields(combiner_props)
combiner_cond = schema_utils.create_cond({"type": combiner_type}, combiner_props)
conds.append(combiner_cond)
return conds
class CombinerSelection(schema_utils.TypeSelection):
def __init__(self):
# For registration of all combiners
import ludwig.combiners.combiners # noqa
super().__init__(registry=combiner_config_registry, default_value=DEFAULT_VALUE, description=DESCRIPTION)
def get_schema_from_registry(self, key: str) -> type[schema_utils.BaseMarshmallowConfig]:
return self.registry[key]
def _jsonschema_type_mapping(self):
return get_combiner_jsonschema()
================================================
FILE: ludwig/schema/common_fields.py
================================================
from dataclasses import Field
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import COMMON_METADATA
from ludwig.schema.metadata.parameter_metadata import ParameterMetadata
from ludwig.utils.torch_utils import initializer_registry
def DropoutField(default: float = 0.0, description: str = None, parameter_metadata: ParameterMetadata = None) -> Field:
description = description or "Default dropout rate applied to fully connected layers."
full_description = description + (
" Increasing dropout is a common form of regularization to combat overfitting. "
"The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout)."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["dropout"]
return schema_utils.FloatRange(
default=default,
min=0,
max=1,
description=full_description,
parameter_metadata=parameter_metadata,
)
def ResidualField(
default: bool = False, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or (
"Whether to add a residual connection to each fully connected layer block. "
"Requires all fully connected layers to have the same `output_size`."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["residual"]
return schema_utils.Boolean(
default=False,
description=description,
parameter_metadata=parameter_metadata,
)
def NumFCLayersField(
default: int = 0, description: str = None, parameter_metadata: ParameterMetadata = None, non_zero=False
) -> Field:
assert (not non_zero) or (default > 0 and non_zero)
description = description or "Number of stacked fully connected layers to apply."
full_description = description + (
" Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["num_fc_layers"]
# When using a dense encoder, the number of fully connected layers must be strictly greater than 0.
if non_zero:
return schema_utils.PositiveInteger(
default=default, allow_none=False, description=full_description, parameter_metadata=parameter_metadata
)
return schema_utils.NonNegativeInteger(
default=default,
allow_none=False,
description=full_description,
parameter_metadata=parameter_metadata,
)
def NormField(
default: str | None = None, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or "Default normalization applied at the beginnging of fully connected layers."
parameter_metadata = parameter_metadata or COMMON_METADATA["norm"]
return schema_utils.StringOptions(
["batch", "layer", "ghost"],
default=default,
allow_none=True,
description=description,
parameter_metadata=parameter_metadata,
)
def NormParamsField(description: str = None, parameter_metadata: ParameterMetadata = None) -> Field:
description = description or "Default parameters passed to the `norm` module."
parameter_metadata = parameter_metadata or COMMON_METADATA["norm_params"]
return schema_utils.Dict(
description=description,
parameter_metadata=parameter_metadata,
)
def FCLayersField(description: str = None, parameter_metadata: ParameterMetadata = None) -> Field:
description = description or (
"List of dictionaries containing the parameters of all the fully connected layers. "
"The length of the list determines the number of stacked fully connected layers "
"and the content of each dictionary determines the parameters for a specific layer. "
"The available parameters for each layer are: `activation`, `dropout`, `norm`, `norm_params`, "
"`output_size`, `use_bias`, `bias_initializer` and `weights_initializer`. If any of those values "
"is missing from the dictionary, the default one provided as a standalone parameter will be used instead."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["fc_layers"]
return schema_utils.DictList(
description=description,
parameter_metadata=parameter_metadata,
)
INITIALIZER_SUFFIX = """
Alternatively it is possible to specify a dictionary with a key `type` that identifies the type of initializer and
other keys for its parameters, e.g. `{type: normal, mean: 0, stddev: 0}`. For a description of the parameters of each
initializer, see [torch.nn.init](https://pytorch.org/docs/stable/nn.init.html).
"""
def BiasInitializerField(
default: str = "zeros", description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
initializers_str = ", ".join([f"`{i}`" for i in initializer_registry.keys()])
description = description or "Initializer for the bias vector."
full_description = f"{description} Options: {initializers_str}. {INITIALIZER_SUFFIX}"
parameter_metadata = parameter_metadata or COMMON_METADATA["bias_initializer"]
return schema_utils.InitializerOrDict(
default=default,
description=full_description,
parameter_metadata=parameter_metadata,
)
def WeightsInitializerField(
default: str = "xavier_uniform", description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
initializers_str = ", ".join([f"`{i}`" for i in initializer_registry.keys()])
description = description or "Initializer for the weight matrix."
full_description = f"{description} Options: {initializers_str}. {INITIALIZER_SUFFIX}"
parameter_metadata = parameter_metadata or COMMON_METADATA["weights_initializer"]
return schema_utils.InitializerOrDict(
default=default,
description=full_description,
parameter_metadata=parameter_metadata,
)
def EmbeddingInitializerField(
default: str | None = None, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or "Initializer for the embedding matrix."
parameter_metadata = parameter_metadata or COMMON_METADATA["embedding_initializer"]
return schema_utils.StringOptions(
list(initializer_registry.keys()),
default=default,
allow_none=True,
description=description,
parameter_metadata=parameter_metadata,
)
def EmbeddingSizeField(
default: int = 256, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or (
"The maximum embedding size. The actual size will be `min(vocabulary_size, embedding_size)` for "
"`dense` representations and exactly `vocabulary_size` for the `sparse` encoding, where `vocabulary_size` "
"is the number of unique strings appearing in the training set input column plus the number of "
"special tokens (``, ``, ``, ``)."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["embedding_size"]
return schema_utils.PositiveInteger(
default=default,
description=description,
parameter_metadata=parameter_metadata,
)
def EmbeddingsOnCPUField(
default: bool = False, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or (
"Whether to force the placement of the embedding matrix in regular memory and have the CPU resolve them. "
"By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, "
"but in some cases the embedding matrix may be too large. This parameter forces the placement of the "
"embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the "
"process as a result of data transfer between CPU and GPU memory."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["embeddings_on_cpu"]
return schema_utils.Boolean(
default=default,
description=description,
parameter_metadata=parameter_metadata,
)
def EmbeddingsTrainableField(
default: bool = True, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or (
"If `true` embeddings are trained during the training process, if `false` embeddings are fixed. "
"It may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter "
"has effect only when `representation` is `dense`; `sparse` one-hot encodings are not trainable."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["embeddings_trainable"]
return schema_utils.Boolean(
default=default,
description=description,
parameter_metadata=parameter_metadata,
)
def PretrainedEmbeddingsField(
default: str | None = None, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or (
"Path to a file containing pretrained embeddings. By default `dense` embeddings are initialized "
"randomly, but this parameter allows to specify a path to a file containing embeddings in the "
"[GloVe format](https://nlp.stanford.edu/projects/glove/). When the file containing the embeddings is "
"loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. "
"If the vocabulary contains strings that have no match in the embeddings file, their embeddings are "
"initialized with the average of all other embedding plus some random noise to make them different "
"from each other. This parameter has effect only if `representation` is `dense`."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["pretrained_embeddings"]
return schema_utils.String(
default=default,
allow_none=True,
description=description,
parameter_metadata=parameter_metadata,
)
def MaxSequenceLengthField(
default: int | None = None, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or "[internal] Maximum sequence length from preprocessing."
parameter_metadata = parameter_metadata or COMMON_METADATA["max_sequence_length"]
return schema_utils.PositiveInteger(
default=default,
allow_none=True,
description=description,
parameter_metadata=parameter_metadata,
)
def VocabField(
default: list | None = None, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or "[internal] Vocabulary for the encoder from preprocessing."
parameter_metadata = parameter_metadata or COMMON_METADATA["vocab"]
return schema_utils.List(
default=default,
description=description,
parameter_metadata=parameter_metadata,
)
def VocabSizeField(
default: list | None = None, description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or "[internal] Size of the vocabulary from preprocessing."
parameter_metadata = parameter_metadata or COMMON_METADATA["vocab_size"]
return schema_utils.PositiveInteger(
default=default,
allow_none=True,
description=description,
parameter_metadata=parameter_metadata,
)
def RepresentationField(
default: str = "dense", description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or (
"Representation of the embedding. `dense` means the embeddings are initialized randomly, "
"`sparse` means they are initialized to be one-hot encodings."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["representation"]
return schema_utils.StringOptions(
["dense", "sparse"],
default=default,
description=description,
parameter_metadata=parameter_metadata,
)
def ReduceOutputField(
default: str | None = "sum", description: str = None, parameter_metadata: ParameterMetadata = None
) -> Field:
description = description or (
"How to reduce the output tensor along the `s` sequence length dimension if the rank of the "
"tensor is greater than 2."
)
parameter_metadata = parameter_metadata or COMMON_METADATA["reduce_output"]
return schema_utils.ReductionOptions(
default=default,
description=description,
parameter_metadata=parameter_metadata,
)
================================================
FILE: ludwig/schema/decoders/__init__.py
================================================
# Register all decoders
import ludwig.schema.decoders.base
import ludwig.schema.decoders.image_decoders # noqa
import ludwig.schema.decoders.llm_decoders # noqa
import ludwig.schema.decoders.sequence_decoders # noqa
================================================
FILE: ludwig/schema/decoders/base.py
================================================
from abc import ABC
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BINARY, CATEGORY, MODEL_ECD, MODEL_LLM, NUMBER, SET, TIMESERIES, VECTOR
from ludwig.schema import common_fields
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.utils import register_decoder_config
from ludwig.schema.metadata import DECODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class BaseDecoderConfig(schema_utils.BaseMarshmallowConfig, ABC):
"""Base class for decoders."""
type: str = schema_utils.StringOptions(
["regressor", "classifier", "projector", "generator", "tagger"],
default=None,
allow_none=True,
description="The type of decoder to use.",
parameter_metadata=DECODER_METADATA["BaseDecoder"]["type"],
)
fc_layers: list[dict] = common_fields.FCLayersField()
num_fc_layers: int = common_fields.NumFCLayersField(
description="Number of fully-connected layers if `fc_layers` not specified."
)
fc_output_size: int = schema_utils.PositiveInteger(
default=256,
description="Output size of fully connected stack.",
parameter_metadata=DECODER_METADATA["BaseDecoder"]["fc_output_size"],
)
fc_use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector in the fc_stack.",
parameter_metadata=DECODER_METADATA["BaseDecoder"]["fc_use_bias"],
)
fc_weights_initializer: str | dict = schema_utils.OneOfOptionsField(
default="xavier_uniform",
allow_none=True,
description="The weights initializer to use for the layers in the fc_stack",
field_options=[
schema_utils.InitializerOptions(
description="Preconfigured initializer to use for the layers in the fc_stack.",
parameter_metadata=DECODER_METADATA["BaseDecoder"]["fc_weights_initializer"],
),
schema_utils.Dict(
description="Custom initializer to use for the layers in the fc_stack.",
parameter_metadata=DECODER_METADATA["BaseDecoder"]["fc_weights_initializer"],
),
],
parameter_metadata=DECODER_METADATA["BaseDecoder"]["fc_weights_initializer"],
)
fc_bias_initializer: str | dict = schema_utils.OneOfOptionsField(
default="zeros",
allow_none=True,
description="The bias initializer to use for the layers in the fc_stack",
field_options=[
schema_utils.InitializerOptions(
description="Preconfigured bias initializer to use for the layers in the fc_stack.",
parameter_metadata=DECODER_METADATA["BaseDecoder"]["fc_bias_initializer"],
),
schema_utils.Dict(
description="Custom bias initializer to use for the layers in the fc_stack.",
parameter_metadata=DECODER_METADATA["BaseDecoder"]["fc_bias_initializer"],
),
],
parameter_metadata=DECODER_METADATA["BaseDecoder"]["fc_bias_initializer"],
)
fc_norm: str = common_fields.NormField()
fc_norm_params: dict = common_fields.NormParamsField()
fc_activation: str = schema_utils.ActivationOptions(default="relu")
fc_dropout: float = common_fields.DropoutField()
@DeveloperAPI
@ludwig_dataclass
class PassthroughDecoderConfig(BaseDecoderConfig):
"""PassthroughDecoderConfig is a dataclass that configures the parameters used for a passthrough decoder."""
@classmethod
def module_name(cls):
return "PassthroughDecoder"
type: str = schema_utils.ProtectedString(
"passthrough",
description="The passthrough decoder simply returns the raw numerical values coming from the combiner as "
"outputs",
parameter_metadata=DECODER_METADATA["PassthroughDecoder"]["type"],
)
input_size: int = schema_utils.PositiveInteger(
default=1,
description="Size of the input to the decoder.",
parameter_metadata=DECODER_METADATA["PassthroughDecoder"]["input_size"],
)
@DeveloperAPI
@register_decoder_config("regressor", [BINARY, NUMBER], model_types=[MODEL_ECD])
@ludwig_dataclass
class RegressorConfig(BaseDecoderConfig):
"""RegressorConfig is a dataclass that configures the parameters used for a regressor decoder."""
@classmethod
def module_name(cls):
return "Regressor"
type: str = schema_utils.ProtectedString(
"regressor",
description=DECODER_METADATA["Regressor"]["type"].long_description,
)
input_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Size of the input to the decoder.",
parameter_metadata=DECODER_METADATA["Regressor"]["input_size"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=DECODER_METADATA["Regressor"]["use_bias"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer for the weight matrix.",
parameter_metadata=DECODER_METADATA["Regressor"]["weights_initializer"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer for the bias vector.",
parameter_metadata=DECODER_METADATA["Regressor"]["bias_initializer"],
)
@DeveloperAPI
@register_decoder_config("projector", [VECTOR, TIMESERIES], model_types=[MODEL_ECD])
@ludwig_dataclass
class ProjectorConfig(BaseDecoderConfig):
"""ProjectorConfig is a dataclass that configures the parameters used for a projector decoder."""
@classmethod
def module_name(cls):
return "Projector"
type: str = schema_utils.ProtectedString(
"projector",
description=DECODER_METADATA["Projector"]["type"].long_description,
)
input_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Size of the input to the decoder.",
parameter_metadata=DECODER_METADATA["Projector"]["input_size"],
)
output_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Size of the output of the decoder.",
parameter_metadata=DECODER_METADATA["Projector"]["output_size"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=DECODER_METADATA["Projector"]["use_bias"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer for the weight matrix.",
parameter_metadata=DECODER_METADATA["Projector"]["weights_initializer"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer for the bias vector.",
parameter_metadata=DECODER_METADATA["Projector"]["bias_initializer"],
)
activation: str = schema_utils.ActivationOptions(
default=None,
description=" Indicates the activation function applied to the output.",
parameter_metadata=DECODER_METADATA["Projector"]["activation"],
)
multiplier: float = schema_utils.FloatRange(
default=1.0,
min=0,
min_inclusive=False,
description=(
"Multiplier to scale the activated outputs by. Useful when setting `activation` to something "
"that outputs a value between [-1, 1] like tanh to re-scale values back to order of magnitude of "
"the data you're trying to predict. A good rule of thumb in such cases is to pick a value like "
"`x * (max - min)` where x is a scalar in the range [1, 2]. For example, if you're trying to predict "
"something like temperature, it might make sense to pick a multiplier on the order of `100`."
),
)
clip: list[int] | tuple[int] = schema_utils.FloatRangeTupleDataclassField(
n=2,
default=None,
allow_none=True,
min=0,
max=999999999,
description="Clip the output of the decoder to be within the given range.",
parameter_metadata=DECODER_METADATA["Projector"]["clip"],
)
@DeveloperAPI
@register_decoder_config("classifier", [CATEGORY, SET], model_types=[MODEL_ECD, MODEL_LLM])
@ludwig_dataclass
class ClassifierConfig(BaseDecoderConfig):
@classmethod
def module_name(cls):
return "Classifier"
type: str = schema_utils.ProtectedString(
"classifier",
description=DECODER_METADATA["Classifier"]["type"].long_description,
)
input_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Size of the input to the decoder.",
parameter_metadata=DECODER_METADATA["Classifier"]["input_size"],
)
num_classes: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of classes to predict.",
parameter_metadata=DECODER_METADATA["Classifier"]["num_classes"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=DECODER_METADATA["Classifier"]["use_bias"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer for the weight matrix.",
parameter_metadata=DECODER_METADATA["Classifier"]["weights_initializer"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer for the bias vector.",
parameter_metadata=DECODER_METADATA["Classifier"]["bias_initializer"],
)
================================================
FILE: ludwig/schema/decoders/image_decoders.py
================================================
from typing import TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import IMAGE, MODEL_ECD
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import register_decoder_config
from ludwig.schema.metadata import DECODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
if TYPE_CHECKING:
from ludwig.schema.features.preprocessing.image import ImagePreprocessingConfig
class ImageDecoderConfig(BaseDecoderConfig):
def set_fixed_preprocessing_params(self, model_type: str, preprocessing: "ImagePreprocessingConfig"):
preprocessing.requires_equal_dimensions = False
preprocessing.height = None
preprocessing.width = None
@DeveloperAPI
@register_decoder_config("unet", [IMAGE], model_types=[MODEL_ECD])
@ludwig_dataclass
class UNetDecoderConfig(ImageDecoderConfig):
@staticmethod
def module_name():
return "UNetDecoder"
type: str = schema_utils.ProtectedString(
"unet",
description=DECODER_METADATA["UNetDecoder"]["type"].long_description,
)
input_size: int = schema_utils.PositiveInteger(
default=1024,
description="Size of the input to the decoder.",
parameter_metadata=DECODER_METADATA["UNetDecoder"]["input_size"],
)
height: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Height of the output image.",
parameter_metadata=DECODER_METADATA["UNetDecoder"]["height"],
)
width: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Width of the output image.",
parameter_metadata=DECODER_METADATA["UNetDecoder"]["width"],
)
num_channels: int | None = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Number of channels in the output image. ",
parameter_metadata=DECODER_METADATA["UNetDecoder"]["num_channels"],
)
conv_norm: str | None = schema_utils.StringOptions(
["batch"],
default="batch",
allow_none=True,
description="This is the default norm that will be used for each double conv layer." "It can be null or batch.",
parameter_metadata=DECODER_METADATA["UNetDecoder"]["conv_norm"],
)
num_classes: int | None = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Number of classes to predict in the output. ",
parameter_metadata=DECODER_METADATA["UNetDecoder"]["num_classes"],
)
================================================
FILE: ludwig/schema/decoders/llm_decoders.py
================================================
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import CATEGORY, MODEL_LLM, TEXT
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import register_decoder_config
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class BaseExtractorDecoderConfig(BaseMarshmallowConfig):
tokenizer: str = "hf_tokenizer"
input_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Size of the input to the decoder.",
)
pretrained_model_name_or_path: str = schema_utils.String(
default="",
allow_none=True,
description="Path to the pretrained model or model identifier from huggingface.co/models.",
)
vocab_file: str = schema_utils.String(
default="",
allow_none=True,
description="Path to the vocabulary file.",
)
max_new_tokens: int = schema_utils.Integer(
default=None,
allow_none=True,
description="Maximum number of new tokens that will be generated.",
)
@DeveloperAPI
@register_decoder_config("text_extractor", [TEXT], model_types=[MODEL_LLM])
@ludwig_dataclass
class TextExtractorDecoderConfig(BaseExtractorDecoderConfig, BaseDecoderConfig):
@classmethod
def module_name(cls):
return "TextExtractorDecoder"
type: str = schema_utils.ProtectedString("text_extractor")
@DeveloperAPI
@register_decoder_config("category_extractor", [CATEGORY], model_types=[MODEL_LLM])
@ludwig_dataclass
class CategoryExtractorDecoderConfig(BaseExtractorDecoderConfig, BaseDecoderConfig):
@classmethod
def module_name(cls):
return "CategoryExtractorDecoder"
type: str = schema_utils.ProtectedString("category_extractor")
# Match is a dict of label class
match: dict[str, dict[str, Any]] = schema_utils.Dict(
default=None,
allow_none=False,
description="A dictionary of label classes and their corresponding "
"match patterns definitions that will be used to parse the output "
"of the LLM.",
)
str2idx: dict[str, int] = schema_utils.Dict(
default=None,
allow_none=True,
description="A dictionary of label classes and their corresponding "
"indices that will be used to parse the output of the LLM.",
)
fallback_label: str = schema_utils.String(
default="",
allow_none=True,
description="The label to use if the parser fails to parse the input.",
)
================================================
FILE: ludwig/schema/decoders/sequence_decoders.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MODEL_ECD, SEQUENCE, TEXT
from ludwig.schema import common_fields
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import register_decoder_config
from ludwig.schema.metadata import DECODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_decoder_config("generator", [SEQUENCE, TEXT], model_types=[MODEL_ECD])
@ludwig_dataclass
class SequenceGeneratorDecoderConfig(BaseDecoderConfig):
@staticmethod
def module_name():
return "SequenceGeneratorDecoder"
type: str = schema_utils.ProtectedString(
"generator",
description=DECODER_METADATA["SequenceGeneratorDecoder"]["type"].long_description,
)
vocab_size: int = common_fields.VocabSizeField()
max_sequence_length: int = common_fields.MaxSequenceLengthField()
cell_type: str = schema_utils.StringOptions(
["rnn", "lstm", "gru"],
default="gru",
description="Type of recurrent cell to use.",
parameter_metadata=DECODER_METADATA["SequenceGeneratorDecoder"]["cell_type"],
)
input_size: int = schema_utils.PositiveInteger(
default=256,
description="Size of the input to the decoder.",
parameter_metadata=DECODER_METADATA["SequenceGeneratorDecoder"]["input_size"],
)
reduce_input: str = schema_utils.StringOptions(
["sum", "mean", "avg", "max", "concat", "last"],
default="sum",
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=DECODER_METADATA["SequenceGeneratorDecoder"]["reduce_input"],
)
num_layers: int = schema_utils.PositiveInteger(
default=1,
description="The number of stacked recurrent layers.",
parameter_metadata=DECODER_METADATA["SequenceGeneratorDecoder"]["num_layers"],
)
@DeveloperAPI
@register_decoder_config("tagger", [SEQUENCE, TEXT], model_types=[MODEL_ECD])
@ludwig_dataclass
class SequenceTaggerDecoderConfig(BaseDecoderConfig):
@classmethod
def module_name(cls):
return "SequenceTaggerDecoder"
type: str = schema_utils.ProtectedString(
"tagger",
description=DECODER_METADATA["SequenceTaggerDecoder"]["type"].long_description,
)
input_size: int = schema_utils.PositiveInteger(
default=256,
description="Size of the input to the decoder.",
parameter_metadata=DECODER_METADATA["SequenceTaggerDecoder"]["input_size"],
)
vocab_size: int = common_fields.VocabSizeField()
max_sequence_length: int = common_fields.MaxSequenceLengthField()
use_attention: bool = schema_utils.Boolean(
default=False,
description="Whether to apply a multi-head self attention layer before prediction.",
parameter_metadata=DECODER_METADATA["SequenceTaggerDecoder"]["use_attention"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=DECODER_METADATA["SequenceTaggerDecoder"]["use_bias"],
)
attention_embedding_size: int = schema_utils.PositiveInteger(
default=256,
description="The embedding size of the multi-head self attention layer.",
parameter_metadata=DECODER_METADATA["SequenceTaggerDecoder"]["attention_embedding_size"],
)
attention_num_heads: int = schema_utils.PositiveInteger(
default=8,
description="The number of attention heads in the multi-head self attention layer.",
parameter_metadata=DECODER_METADATA["SequenceTaggerDecoder"]["attention_num_heads"],
)
================================================
FILE: ludwig/schema/decoders/utils.py
================================================
from dataclasses import Field
from typing import Any, TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MODEL_ECD, TYPE
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import DECODER_METADATA
from ludwig.schema.metadata.parameter_metadata import convert_metadata_to_json
from ludwig.utils.registry import Registry
if TYPE_CHECKING:
from ludwig.schema.decoders.base import BaseDecoderConfig
decoder_config_registry = Registry()
@DeveloperAPI
def register_decoder_config(name: str, features: str | list[str], model_types: list[str] | None = None):
if model_types is None:
model_types = [MODEL_ECD]
if isinstance(features, str):
features = [features]
def wrap(cls):
for model_type in model_types:
for feature in features:
key = (model_type, feature)
feature_registry = decoder_config_registry.get(key, {})
feature_registry[name] = cls
decoder_config_registry[key] = feature_registry
return cls
return wrap
@DeveloperAPI
def get_decoder_cls(model_type: str, feature: str, name: str):
return decoder_config_registry[(model_type, feature)][name]
@DeveloperAPI
def get_decoder_classes(model_type: str, feature: str) -> dict[str, type["BaseDecoderConfig"]]:
return decoder_config_registry[(model_type, feature)]
@DeveloperAPI
def get_decoder_descriptions(model_type: str, feature_type: str):
"""This function returns a dictionary of decoder descriptions available at the type selection.
The process works as follows - 1) Get a dictionary of valid decoders from the decoder config registry,
but inverse the key/value pairs since we need to index `valid_decoders` later with an altered version
of the decoder config class name. 2) Loop through Decoder Metadata entries, if a metadata entry has a
decoder name that matches a valid decoder, add the description metadata to the output dictionary.
Args:
model_type (str): The model type to get decoder descriptions for
feature_type (str): The feature type to get decoder descriptions for
Returns:
dict: A dictionary of decoder descriptions
"""
output = {}
valid_decoders = {
cls.module_name() if hasattr(cls, "module_name") else None: registered_name
for registered_name, cls in get_decoder_classes(model_type, feature_type).items()
}
for k, v in DECODER_METADATA.items():
if k in valid_decoders.keys():
output[valid_decoders[k]] = convert_metadata_to_json(v[TYPE])
return output
@DeveloperAPI
def get_decoder_conds(decoder_classes: dict[str, type["BaseDecoderConfig"]]) -> list[dict[str, Any]]:
"""Returns a JSON schema of conditionals to validate against decoder types for specific feature types."""
conds = []
for decoder_type, decoder_cls in decoder_classes.items():
other_props = schema_utils.unload_jsonschema_from_marshmallow_class(decoder_cls)["properties"]
schema_utils.remove_duplicate_fields(other_props)
decoder_cond = schema_utils.create_cond(
{"type": decoder_type},
other_props,
)
conds.append(decoder_cond)
return conds
@DeveloperAPI
def DecoderDataclassField(model_type: str, feature_type: str, default: str) -> Field:
"""Custom dataclass field that when used inside a dataclass will allow the user to specify a decoder config.
Returns: Initialized dataclass field that converts an untyped dict with params to a decoder config.
"""
decoder_registry = get_decoder_classes(model_type, feature_type)
class DecoderSelection(schema_utils.TypeSelection):
def __init__(self):
super().__init__(registry=decoder_registry, default_value=default, allow_str_value=True)
def get_schema_from_registry(self, key: str) -> type[schema_utils.BaseMarshmallowConfig]:
return decoder_registry[key]
def _jsonschema_type_mapping(self):
return {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": list(decoder_registry.keys()),
"enumDescriptions": get_decoder_descriptions(model_type, feature_type),
"default": default,
},
},
"title": "decoder_options",
"allOf": get_decoder_conds(decoder_registry),
}
return DecoderSelection().get_default_field()
================================================
FILE: ludwig/schema/defaults/__init__.py
================================================
================================================
FILE: ludwig/schema/defaults/base.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class BaseDefaultsConfig(schema_utils.BaseMarshmallowConfig):
"""Base defaults config class."""
================================================
FILE: ludwig/schema/defaults/ecd.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
AUDIO,
BAG,
BINARY,
CATEGORY,
DATE,
H3,
IMAGE,
NUMBER,
SEQUENCE,
SET,
TEXT,
TIMESERIES,
VECTOR,
)
from ludwig.schema import utils as schema_utils
from ludwig.schema.defaults.base import BaseDefaultsConfig
from ludwig.schema.defaults.utils import DefaultsDataclassField
from ludwig.schema.features.base import BaseFeatureConfig
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class ECDDefaultsConfig(BaseDefaultsConfig):
audio: BaseFeatureConfig = DefaultsDataclassField(feature_type=AUDIO)
bag: BaseFeatureConfig = DefaultsDataclassField(feature_type=BAG)
binary: BaseFeatureConfig = DefaultsDataclassField(feature_type=BINARY)
category: BaseFeatureConfig = DefaultsDataclassField(feature_type=CATEGORY)
date: BaseFeatureConfig = DefaultsDataclassField(feature_type=DATE)
h3: BaseFeatureConfig = DefaultsDataclassField(feature_type=H3)
image: BaseFeatureConfig = DefaultsDataclassField(feature_type=IMAGE)
number: BaseFeatureConfig = DefaultsDataclassField(feature_type=NUMBER)
sequence: BaseFeatureConfig = DefaultsDataclassField(feature_type=SEQUENCE)
set: BaseFeatureConfig = DefaultsDataclassField(feature_type=SET)
text: BaseFeatureConfig = DefaultsDataclassField(feature_type=TEXT)
timeseries: BaseFeatureConfig = DefaultsDataclassField(feature_type=TIMESERIES)
vector: BaseFeatureConfig = DefaultsDataclassField(feature_type=VECTOR)
@DeveloperAPI
class ECDDefaultsField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(ECDDefaultsConfig)
================================================
FILE: ludwig/schema/defaults/llm.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import TEXT
from ludwig.schema import utils as schema_utils
from ludwig.schema.defaults.base import BaseDefaultsConfig
from ludwig.schema.defaults.utils import DefaultsDataclassField
from ludwig.schema.features.base import BaseFeatureConfig
from ludwig.schema.features.utils import llm_defaults_config_registry
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class LLMDefaultsConfig(BaseDefaultsConfig):
text: BaseFeatureConfig = DefaultsDataclassField(feature_type=TEXT, defaults_registry=llm_defaults_config_registry)
@DeveloperAPI
class LLMDefaultsField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(LLMDefaultsConfig)
================================================
FILE: ludwig/schema/defaults/utils.py
================================================
from dataclasses import field
import ludwig.schema.utils as schema_utils
from ludwig.api_annotations import DeveloperAPI
from ludwig.error import ConfigValidationError
from ludwig.schema.features.utils import ecd_defaults_config_registry
from ludwig.utils.registry import Registry
@DeveloperAPI
def DefaultsDataclassField(feature_type: str, defaults_registry: Registry = ecd_defaults_config_registry):
"""Custom dataclass field that when used inside a dataclass will allow the user to specify a nested default
config for a specific feature type.
Returns: Initialized dataclass field that converts an untyped dict with params to a defaults config.
"""
class DefaultMarshmallowField(schema_utils.LudwigSchemaField):
"""Custom field that deserializes a dict for a valid defaults config from the feature_registry and creates
a corresponding JSON schema for external usage."""
def _deserialize(self, value, attr, data, **kwargs):
if value is None:
return None
if isinstance(value, dict):
defaults_class = defaults_registry[feature_type]
try:
return defaults_class.Schema().load(value)
except (TypeError, ConfigValidationError) as error:
raise ConfigValidationError(f"Invalid params: {value}, see `{attr}` definition. Error: {error}")
raise ConfigValidationError(f"Invalid params: {value}")
def _jsonschema_type_mapping(self):
defaults_cls = defaults_registry[feature_type]
props = schema_utils.unload_jsonschema_from_marshmallow_class(defaults_cls)["properties"]
return {
"type": "object",
"properties": props,
"additionalProperties": False,
"title": "defaults_options",
}
try:
defaults_cls = defaults_registry[feature_type]
dump_default = defaults_cls.Schema().dump({})
load_default = lambda: defaults_cls.Schema().load({})
return field(
metadata={
"marshmallow_field": DefaultMarshmallowField(
allow_none=False,
dump_default=dump_default,
load_default=load_default,
)
},
default_factory=load_default,
)
except Exception as e:
raise ConfigValidationError(
f"Unsupported feature type: {feature_type}. Allowed: {defaults_registry.keys()}. " f"Details: {e}"
)
================================================
FILE: ludwig/schema/encoders/__init__.py
================================================
# Register all encoder schemas
import ludwig.schema.encoders.bag_encoders
import ludwig.schema.encoders.category_encoders
import ludwig.schema.encoders.date_encoders
import ludwig.schema.encoders.h3_encoders
import ludwig.schema.encoders.image
import ludwig.schema.encoders.sequence_encoders
import ludwig.schema.encoders.set_encoders
import ludwig.schema.encoders.text_encoders # noqa
================================================
FILE: ludwig/schema/encoders/bag_encoders.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BAG
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_encoder_config("embed", BAG)
@ludwig_dataclass
class BagEmbedWeightedConfig(BaseEncoderConfig):
@staticmethod
def module_name():
return "BagEmbedWeighted"
type: str = schema_utils.ProtectedString(
"embed",
description=ENCODER_METADATA["BagEmbedWeighted"]["type"].long_description,
)
dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout probability for the embedding.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["dropout"],
)
activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["activation"],
)
vocab: list[str] = schema_utils.List(
default=None,
description="Vocabulary of the encoder",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["vocab"],
)
representation: str = schema_utils.StringOptions(
["dense", "sparse"],
default="dense",
description="The representation of the embedding. Either dense or sparse.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["representation"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=50,
description="The maximum embedding size, the actual size will be min(vocabulary_size, embedding_size) for "
"dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size "
"is the number of different strings appearing in the training set in the input column (plus 1 for "
"the unknown token placeholder ).",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["embedding_size"],
)
force_embedding_size: bool = schema_utils.Boolean(
default=False,
description="Force the embedding size to be equal to the vocabulary size. This parameter has effect only if "
"representation is dense.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["force_embedding_size"],
)
embeddings_on_cpu: bool = schema_utils.Boolean(
default=False,
description="By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster "
"access, but in some cases the embedding matrix may be too large. This parameter forces the "
"placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, "
"slightly slowing down the process as a result of data transfer between CPU and GPU memory.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["embeddings_on_cpu"],
)
embeddings_trainable: bool = schema_utils.Boolean(
default=True,
description="If true embeddings are trained during the training process, if false embeddings are fixed. It "
"may be useful when loading pretrained embeddings for avoiding fine tuning them. This parameter "
"has effect only when representation is dense as sparse one-hot encodings are not trainable.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["embeddings_trainable"],
)
pretrained_embeddings: str = schema_utils.String(
default=None,
allow_none=True,
description="By default dense embeddings are initialized randomly, but this parameter allows to specify a "
"path to a file containing embeddings in the GloVe format. When the file containing the "
"embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, "
"the others are discarded. If the vocabulary contains strings that have no match in the "
"embeddings file, their embeddings are initialized with the average of all other embedding plus "
"some random noise to make them different from each other. This parameter has effect only if "
"representation is dense.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["pretrained_embeddings"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["use_bias"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer to use for the bias vector.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["bias_initializer"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer to use for the weights matrix.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["weights_initializer"],
)
output_size: int = schema_utils.PositiveInteger(
default=10,
description="If output_size is not already specified in fc_layers this is the default output_size that will "
"be used for each layer. It indicates the size of the output of a fully connected layer.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["output_size"],
)
norm: str = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="The default norm that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["norm"],
)
norm_params: dict = schema_utils.Dict(
default=None,
description="Parameters used if norm is either `batch` or `layer`.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["norm_params"],
)
num_fc_layers: int = schema_utils.NonNegativeInteger(
default=0,
description="This is the number of stacked fully connected layers that the input to the feature passes "
"through. Their output is projected in the feature's output space.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["num_fc_layers"],
)
fc_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for fc_layers
default=None,
description="List of dictionaries containing the parameters for each fully connected layer.",
parameter_metadata=ENCODER_METADATA["BagEmbedWeighted"]["fc_layers"],
)
================================================
FILE: ludwig/schema/encoders/base.py
================================================
from abc import ABC
from typing import TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BINARY, MODEL_ECD, MODEL_LLM, NUMBER, TEXT, TIMESERIES, VECTOR
from ludwig.schema import common_fields
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
if TYPE_CHECKING:
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
@DeveloperAPI
@ludwig_dataclass
class BaseEncoderConfig(schema_utils.BaseMarshmallowConfig, ABC):
"""Base class for encoders."""
type: str
skip: bool = schema_utils.Boolean(
False,
"[internal] Whether to skip encoder and use input as output.",
parameter_metadata=ENCODER_METADATA["BaseEncoder"]["skip"],
)
def set_fixed_preprocessing_params(self, model_type: str, preprocessing: "BasePreprocessingConfig"):
pass
def is_pretrained(self) -> bool:
return False
def can_cache_embeddings(self) -> bool:
return False
@DeveloperAPI
@register_encoder_config("passthrough", [TEXT], model_types=[MODEL_LLM])
@register_encoder_config("passthrough", [BINARY, NUMBER, VECTOR], model_types=[MODEL_ECD])
@ludwig_dataclass
class PassthroughEncoderConfig(BaseEncoderConfig):
"""PassthroughEncoderConfig is a dataclass that configures the parameters used for a passthrough encoder."""
@staticmethod
def module_name():
return "PassthroughEncoder"
type: str = schema_utils.ProtectedString(
"passthrough",
description=ENCODER_METADATA["PassthroughEncoder"]["type"].long_description,
)
@DeveloperAPI
@register_encoder_config("dense", [BINARY, NUMBER, VECTOR, TIMESERIES])
@ludwig_dataclass
class DenseEncoderConfig(BaseEncoderConfig):
"""DenseEncoderConfig is a dataclass that configures the parameters used for a dense encoder."""
@staticmethod
def module_name():
return "DenseEncoder"
type: str = schema_utils.ProtectedString(
"dense",
description=ENCODER_METADATA["DenseEncoder"]["type"].long_description,
)
dropout: float = common_fields.DropoutField()
activation: str = schema_utils.ActivationOptions()
input_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Size of the input to the dense encoder.",
parameter_metadata=ENCODER_METADATA["DenseEncoder"]["input_size"],
)
output_size: int = schema_utils.PositiveInteger(
default=256,
description="Size of the output of the feature.",
parameter_metadata=ENCODER_METADATA["DenseEncoder"]["output_size"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=ENCODER_METADATA["DenseEncoder"]["use_bias"],
)
bias_initializer: str | dict = common_fields.BiasInitializerField()
weights_initializer: str | dict = common_fields.WeightsInitializerField()
norm: str = common_fields.NormField()
norm_params: dict = common_fields.NormParamsField()
num_layers: int = common_fields.NumFCLayersField(default=1, non_zero=True)
fc_layers: list[dict] = common_fields.FCLayersField()
================================================
FILE: ludwig/schema/encoders/category_encoders.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import CATEGORY, MODEL_ECD
from ludwig.schema import common_fields
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_encoder_config("passthrough", CATEGORY, model_types=[MODEL_ECD])
@ludwig_dataclass
class CategoricalPassthroughEncoderConfig(BaseEncoderConfig):
"""CategoricalPassthroughEncoderConfig is a dataclass that configures the parameters used for a categorical
passthrough encoder."""
@staticmethod
def module_name():
return "CategoricalPassthroughEncoder"
type: str = schema_utils.ProtectedString(
"passthrough",
description=ENCODER_METADATA["PassthroughEncoder"]["type"].long_description,
)
@DeveloperAPI
@register_encoder_config("dense", CATEGORY)
@ludwig_dataclass
class CategoricalEmbedConfig(BaseEncoderConfig):
@staticmethod
def module_name():
return "CategoricalEmbed"
type: str = schema_utils.ProtectedString(
"dense",
description=ENCODER_METADATA["CategoricalEmbed"]["type"].long_description,
)
dropout: float = common_fields.DropoutField()
vocab: list[str] = common_fields.VocabField()
embedding_initializer: str = common_fields.EmbeddingInitializerField()
embedding_size: int = common_fields.EmbeddingSizeField(
default=50,
description=(
"The maximum embedding size, the actual size will be min(vocabulary_size, embedding_size) for "
"dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size "
"is the number of different strings appearing in the training set in the column the feature is "
"named after (plus 1 for )."
),
)
embeddings_on_cpu: bool = common_fields.EmbeddingsOnCPUField()
embeddings_trainable: bool = common_fields.EmbeddingsTrainableField()
pretrained_embeddings: str = common_fields.PretrainedEmbeddingsField()
@DeveloperAPI
@register_encoder_config("sparse", CATEGORY)
@ludwig_dataclass
class CategoricalSparseConfig(BaseEncoderConfig):
@staticmethod
def module_name():
return "CategorySparse"
type: str = schema_utils.ProtectedString(
"sparse",
description=ENCODER_METADATA["CategoricalSparse"]["type"].long_description,
)
dropout: float = common_fields.DropoutField()
vocab: list[str] = common_fields.VocabField()
embedding_initializer: str = common_fields.EmbeddingInitializerField()
embeddings_on_cpu: bool = common_fields.EmbeddingsOnCPUField()
# TODO(travis): seems like this is not really a valid user option. We should probably just remove these
# params entirely and update the encoder implementation.
embeddings_trainable: bool = common_fields.EmbeddingsTrainableField(default=False)
pretrained_embeddings: str = common_fields.PretrainedEmbeddingsField()
@DeveloperAPI
@register_encoder_config("onehot", CATEGORY, model_types=[MODEL_ECD])
@ludwig_dataclass
class CategoricalOneHotEncoderConfig(BaseEncoderConfig):
"""CategoricalOneHotEncoderConfig is a dataclass that configures the parameters used for a categorical onehot
encoder."""
type: str = schema_utils.ProtectedString(
"onehot",
description="Type of encoder.",
)
vocab: list[str] = common_fields.VocabField()
def can_cache_embeddings(self) -> bool:
return True
================================================
FILE: ludwig/schema/encoders/date_encoders.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DATE
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_encoder_config("embed", DATE)
@ludwig_dataclass
class DateEmbedConfig(BaseEncoderConfig):
@staticmethod
def module_name():
return "DateEmbed"
type: str = schema_utils.ProtectedString(
"embed",
description=ENCODER_METADATA["DateEmbed"]["type"].long_description,
)
dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout probability for the embedding.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["dropout"],
)
activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["activation"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["use_bias"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer to use for the bias vector.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["bias_initializer"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer to use for the weights matrix.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["weights_initializer"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=10,
description="The maximum embedding size adopted.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["embedding_size"],
)
embeddings_on_cpu: bool = schema_utils.Boolean(
default=False,
description="Whether to force the placement of the embedding matrix in regular memory and have the CPU "
"resolve them.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["embeddings_on_cpu"],
)
output_size: int = schema_utils.PositiveInteger(
default=10,
description="If an output_size is not already specified in fc_layers this is the default output_size that "
"will be used for each layer. It indicates the size of the output of a fully connected layer.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["output_size"],
)
norm: str = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="The default norm that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["norm"],
)
norm_params: dict = schema_utils.Dict(
default=None,
description="Parameters used if norm is either `batch` or `layer`.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["norm_params"],
)
num_fc_layers: int = schema_utils.NonNegativeInteger(
default=0,
description="The number of stacked fully connected layers.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["num_fc_layers"],
)
# TODO (Connor): Add nesting logic for fc_layers, see fully_connected_module.py
fc_layers: list[dict] = schema_utils.DictList(
default=None,
description="List of dictionaries containing the parameters for each fully connected layer.",
parameter_metadata=ENCODER_METADATA["DateEmbed"]["fc_layers"],
)
@DeveloperAPI
@register_encoder_config("wave", DATE)
@ludwig_dataclass
class DateWaveConfig(BaseEncoderConfig):
@staticmethod
def module_name():
return "DateWave"
type: str = schema_utils.ProtectedString(
"wave",
description=ENCODER_METADATA["DateWave"]["type"].long_description,
)
dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout probability for the embedding.",
parameter_metadata=ENCODER_METADATA["DateWave"]["dropout"],
)
activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["DateWave"]["activation"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=ENCODER_METADATA["DateWave"]["use_bias"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer to use for the bias vector.",
parameter_metadata=ENCODER_METADATA["DateWave"]["bias_initializer"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer to use for the weights matrix.",
parameter_metadata=ENCODER_METADATA["DateWave"]["weights_initializer"],
)
output_size: int = schema_utils.PositiveInteger(
default=10,
description="If an output_size is not already specified in fc_layers this is the default output_size that "
"will be used for each layer. It indicates the size of the output of a fully connected layer.",
parameter_metadata=ENCODER_METADATA["DateWave"]["output_size"],
)
norm: str = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="The default norm that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["DateWave"]["norm"],
)
norm_params: dict = schema_utils.Dict(
default=None,
description="Parameters used if norm is either `batch` or `layer`.",
parameter_metadata=ENCODER_METADATA["DateWave"]["norm_params"],
)
num_fc_layers: int = schema_utils.PositiveInteger(
default=1,
description="The number of stacked fully connected layers.",
parameter_metadata=ENCODER_METADATA["DateWave"]["num_fc_layers"],
)
# TODO (Connor): Add nesting logic for fc_layers, see fully_connected_module.py
fc_layers: list[dict] = schema_utils.DictList(
default=None,
description="List of dictionaries containing the parameters for each fully connected layer.",
parameter_metadata=ENCODER_METADATA["DateWave"]["fc_layers"],
)
================================================
FILE: ludwig/schema/encoders/h3_encoders.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import H3
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_encoder_config("embed", H3)
@ludwig_dataclass
class H3EmbedConfig(BaseEncoderConfig):
@staticmethod
def module_name():
return "H3Embed"
type: str = schema_utils.ProtectedString(
"embed",
description=ENCODER_METADATA["H3Embed"]["type"].long_description,
)
dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout probability for the embedding.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["dropout"],
)
activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["activation"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["use_bias"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer to use for the bias vector.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["bias_initializer"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer to use for the weights matrix.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["weights_initializer"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=10,
description="The maximum embedding size adopted.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["embedding_size"],
)
embeddings_on_cpu: bool = schema_utils.Boolean(
default=False,
description="Whether to force the placement of the embedding matrix in regular memory and have the CPU "
"resolve them.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["embeddings_on_cpu"],
)
reduce_output: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce the output tensor along the `s` sequence length dimension if the rank of the "
"tensor is greater than 2.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["reduce_output"],
)
output_size: int = schema_utils.PositiveInteger(
default=10,
description="If an output_size is not already specified in fc_layers this is the default output_size that "
"will be used for each layer. It indicates the size of the output of a fully connected layer.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["output_size"],
)
norm: str = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="The default norm that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["norm"],
)
norm_params: dict = schema_utils.Dict(
default=None,
description="Parameters used if norm is either `batch` or `layer`.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["norm_params"],
)
num_fc_layers: int = schema_utils.NonNegativeInteger(
default=0,
description="The number of stacked fully connected layers.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["num_fc_layers"],
)
fc_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for fc_layers
default=None,
description="List of dictionaries containing the parameters for each fully connected layer.",
parameter_metadata=ENCODER_METADATA["H3Embed"]["fc_layers"],
)
@DeveloperAPI
@register_encoder_config("weighted_sum", H3)
@ludwig_dataclass
class H3WeightedSumConfig(BaseEncoderConfig):
@staticmethod
def module_name():
return "H3WeightedSum"
type: str = schema_utils.ProtectedString(
"weighted_sum",
description=ENCODER_METADATA["H3WeightedSum"]["type"].long_description,
)
dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout probability for the embedding.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["dropout"],
)
activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["activation"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["use_bias"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer to use for the bias vector.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["bias_initializer"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer to use for the weights matrix.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["weights_initializer"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=10,
description="The maximum embedding size adopted.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["embedding_size"],
)
embeddings_on_cpu: bool = schema_utils.Boolean(
default=False,
description="Whether to force the placement of the embedding matrix in regular memory and have the CPU "
"resolve them.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["embeddings_on_cpu"],
)
should_softmax: bool = schema_utils.Boolean(
default=False,
description="Determines if the weights of the weighted sum should be passed though a softmax layer before "
"being used.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["should_softmax"],
)
output_size: int = schema_utils.PositiveInteger(
default=10,
description="If an output_size is not already specified in fc_layers this is the default output_size that "
"will be used for each layer. It indicates the size of the output of a fully connected layer.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["output_size"],
)
norm: str = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="The default norm that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["norm"],
)
norm_params: dict = schema_utils.Dict(
default=None,
description="Parameters used if norm is either `batch` or `layer`.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["norm_params"],
)
num_fc_layers: int = schema_utils.NonNegativeInteger(
default=0,
description="The number of stacked fully connected layers.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["num_fc_layers"],
)
fc_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for fc_layers
default=None,
description="List of dictionaries containing the parameters for each fully connected layer.",
parameter_metadata=ENCODER_METADATA["H3WeightedSum"]["fc_layers"],
)
@DeveloperAPI
@register_encoder_config("rnn", H3)
@ludwig_dataclass
class H3RNNConfig(BaseEncoderConfig):
@staticmethod
def module_name():
return "H3RNN"
type: str = schema_utils.ProtectedString(
"rnn",
description=ENCODER_METADATA["H3RNN"]["type"].long_description,
)
dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="The dropout rate",
parameter_metadata=ENCODER_METADATA["H3RNN"]["dropout"],
)
recurrent_dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="The dropout rate for the recurrent state",
parameter_metadata=ENCODER_METADATA["H3RNN"]["recurrent_dropout"],
)
activation: str = schema_utils.ActivationOptions(
default="tanh",
description="The activation function to use",
parameter_metadata=ENCODER_METADATA["H3RNN"]["activation"],
)
recurrent_activation: str = schema_utils.ActivationOptions(
default="sigmoid",
description="The activation function to use in the recurrent step",
parameter_metadata=ENCODER_METADATA["H3RNN"]["recurrent_activation"],
)
cell_type: str = schema_utils.StringOptions(
["rnn", "lstm", "lstm_block", "ln", "lstm_cudnn", "gru", "gru_block", "gru_cudnn"],
default="rnn",
description="The type of recurrent cell to use. Available values are: `rnn`, `lstm`, `lstm_block`, `lstm`, "
"`ln`, `lstm_cudnn`, `gru`, `gru_block`, `gru_cudnn`. For reference about the differences between "
"the cells please refer to PyTorch's documentation. We suggest to use the `block` variants on "
"CPU and the `cudnn` variants on GPU because of their increased speed. ",
parameter_metadata=ENCODER_METADATA["H3RNN"]["cell_type"],
)
num_layers: int = schema_utils.PositiveInteger(
default=1,
description="The number of stacked recurrent layers.",
parameter_metadata=ENCODER_METADATA["H3RNN"]["num_layers"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=10,
description="The size of the hidden representation within the transformer block. It is usually the same as "
"the embedding_size, but if the two values are different, a projection layer will be added before "
"the first transformer block.",
parameter_metadata=ENCODER_METADATA["H3RNN"]["hidden_size"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether to use a bias vector.",
parameter_metadata=ENCODER_METADATA["H3RNN"]["use_bias"],
)
unit_forget_bias: bool = schema_utils.Boolean(
default=True,
description="If true, add 1 to the bias of the forget gate at initialization",
parameter_metadata=ENCODER_METADATA["H3RNN"]["unit_forget_bias"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer to use for the bias vector.",
parameter_metadata=ENCODER_METADATA["H3RNN"]["bias_initializer"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer to use for the weights matrix.",
parameter_metadata=ENCODER_METADATA["H3RNN"]["weights_initializer"],
)
recurrent_initializer: str = schema_utils.InitializerOptions(
default="orthogonal",
description="The initializer for recurrent matrix weights",
parameter_metadata=ENCODER_METADATA["H3RNN"]["recurrent_initializer"],
)
reduce_output: str = schema_utils.ReductionOptions(
default="last",
description="How to reduce the output tensor along the `s` sequence length dimension if the rank of the "
"tensor is greater than 2.",
parameter_metadata=ENCODER_METADATA["H3RNN"]["reduce_output"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=10,
description="The maximum embedding size adopted.",
parameter_metadata=ENCODER_METADATA["H3RNN"]["embedding_size"],
)
embeddings_on_cpu: bool = schema_utils.Boolean(
default=False,
description="Whether to force the placement of the embedding matrix in regular memory and have the CPU "
"resolve them.",
parameter_metadata=ENCODER_METADATA["H3RNN"]["embeddings_on_cpu"],
)
bidirectional: bool = schema_utils.Boolean(
default=False,
description="If true, two recurrent networks will perform encoding in the forward and backward direction and "
"their outputs will be concatenated.",
parameter_metadata=ENCODER_METADATA["H3RNN"]["bidirectional"],
)
================================================
FILE: ludwig/schema/encoders/image/__init__.py
================================================
import ludwig.schema.encoders.image.base
import ludwig.schema.encoders.image.timm # noqa
import ludwig.schema.encoders.image.torchvision # noqa
================================================
FILE: ludwig/schema/encoders/image/base.py
================================================
from typing import Any, TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import IMAGE
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils.torch_utils import initializer_registry
if TYPE_CHECKING:
from ludwig.schema.features.preprocessing.image import ImagePreprocessingConfig
class ImageEncoderConfig(BaseEncoderConfig):
def set_fixed_preprocessing_params(self, model_type: str, preprocessing: "ImagePreprocessingConfig"):
preprocessing.requires_equal_dimensions = False
preprocessing.height = None
preprocessing.width = None
@DeveloperAPI
@register_encoder_config("stacked_cnn", IMAGE)
@ludwig_dataclass
class Stacked2DCNNConfig(ImageEncoderConfig):
@staticmethod
def module_name():
return "Stacked2DCNN"
type: str = schema_utils.ProtectedString(
"stacked_cnn",
description=ENCODER_METADATA["Stacked2DCNN"]["type"].long_description,
)
conv_dropout: int | None = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout rate",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["conv_dropout"],
)
conv_activation: str = schema_utils.ActivationOptions(
description="If an activation is not already specified in conv_layers this is the default activation that "
"will be used for each layer. It indicates the activation function applied to the output.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["conv_activation"],
)
height: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Height of the input image.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["height"],
)
width: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Width of the input image.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["width"],
)
num_channels: int | None = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Number of channels to use in the encoder. ",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["num_channels"],
)
out_channels: int | None = schema_utils.NonNegativeInteger(
default=32,
description="Indicates the number of filters, and by consequence the output channels of the 2d convolution. "
"If out_channels is not already specified in conv_layers this is the default out_channels that "
"will be used for each layer. ",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["out_channels"],
)
kernel_size: int | tuple[int] | None = schema_utils.OneOfOptionsField(
default=3,
description="An integer or pair of integers specifying the kernel size. A single integer specifies a square "
"kernel, while a pair of integers specifies the height and width of the kernel in that order (h, "
"w). If a kernel_size is not specified in conv_layers this kernel_size that will be used for "
"each layer.",
field_options=[
schema_utils.PositiveInteger(allow_none=False, description="", default=3),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["kernel_size"],
)
stride: int | tuple[int] | None = schema_utils.OneOfOptionsField(
default=1,
description="An integer or pair of integers specifying the stride of the convolution along the height and "
"width. If a stride is not already specified in conv_layers, specifies the default stride of the "
"2D convolutional kernel that will be used for each layer.",
field_options=[
schema_utils.PositiveInteger(allow_none=False, description="", default=1),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["stride"],
)
padding_mode: str | None = schema_utils.StringOptions(
options=["zeros", "reflect", "replicate", "circular"],
default="zeros",
description="If padding_mode is not already specified in conv_layers, specifies the default padding_mode of "
"the 2D convolutional kernel that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["padding_mode"],
)
padding: int | tuple[int] | str | None = schema_utils.OneOfOptionsField(
default="valid",
allow_none=True,
description="An int, pair of ints (h, w), or one of ['valid', 'same'] specifying the padding used for"
"convolution kernels.",
field_options=[
schema_utils.NonNegativeInteger(allow_none=True, description="", default=None),
schema_utils.List(list_type=int, allow_none=False),
schema_utils.StringOptions(options=["valid", "same"], default="valid", allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["padding"],
)
dilation: int | tuple[int] | None = schema_utils.OneOfOptionsField(
default=1,
allow_none=True,
description="An int or pair of ints specifying the dilation rate to use for dilated convolution. If dilation "
"is not already specified in conv_layers, specifies the default dilation of the 2D convolutional "
"kernel that will be used for each layer.",
field_options=[
schema_utils.PositiveInteger(allow_none=True, description="", default=None),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["dilation"],
)
groups: int | None = schema_utils.PositiveInteger(
default=1,
description="Groups controls the connectivity between convolution inputs and outputs. When groups = 1, each "
"output channel depends on every input channel. When groups > 1, input and output channels are "
"divided into groups separate groups, where each output channel depends only on the inputs in its "
"respective input channel group. in_channels and out_channels must both be divisible by groups.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["groups"],
)
pool_function: str | None = schema_utils.StringOptions(
["max", "average", "avg", "mean"],
default="max",
description="Pooling function to use.",
parameter_metadata=ENCODER_METADATA["conv_params"]["pool_function"],
)
pool_kernel_size: int | tuple[int] | None = schema_utils.OneOfOptionsField(
default=2,
allow_none=True,
description="An integer or pair of integers specifying the pooling size. If pool_kernel_size is not specified "
"in conv_layers this is the default value that will be used for each layer.",
field_options=[
schema_utils.PositiveInteger(allow_none=True, description="", default=None),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["pool_kernel_size"],
)
pool_stride: int | tuple[int] | None = schema_utils.OneOfOptionsField(
default=None,
allow_none=True,
description="An integer or pair of integers specifying the pooling stride, which is the factor by which the "
"pooling layer downsamples the feature map. Defaults to pool_kernel_size.",
field_options=[
schema_utils.PositiveInteger(allow_none=True, description="", default=None),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["pool_stride"],
)
pool_padding: int | tuple[int] | None = schema_utils.OneOfOptionsField(
default=0,
allow_none=True,
description="An integer or pair of ints specifying pooling padding (h, w).",
field_options=[
schema_utils.NonNegativeInteger(allow_none=True, description="", default=None),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["pool_padding"],
)
pool_dilation: int | tuple[int] | None = schema_utils.OneOfOptionsField(
default=1,
allow_none=True,
description="An integer or pair of ints specifying pooling dilation rate (h, w).",
field_options=[
schema_utils.PositiveInteger(default=None, allow_none=True, description=""),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["pool_dilation"],
)
output_size: int | None = schema_utils.PositiveInteger(
default=128,
description="If output_size is not already specified in fc_layers this is the default output_size that will "
"be used for each layer. It indicates the size of the output of a fully connected layer. ",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["output_size"],
)
conv_use_bias: bool | None = schema_utils.Boolean(
default=True,
description="If bias not already specified in conv_layers, specifies if the 2D convolutional kernel should "
"have a bias term.",
)
conv_norm: str | None = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="If a norm is not already specified in conv_layers this is the default norm that will be used for "
"each layer. It indicates the normalization applied to the activations and can be null, "
"batch or layer.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["conv_norm"],
)
conv_norm_params: dict[str, Any] | None = schema_utils.Dict(
default=None,
description="Parameters used if conv_norm is either batch or layer. ",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["conv_norm_params"],
)
num_conv_layers: int | None = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Number of convolutional layers to use in the encoder. ",
parameter_metadata=ENCODER_METADATA["conv_params"]["num_conv_layers"],
)
conv_layers: list[dict] | None = schema_utils.DictList(
default=None,
description="List of convolutional layers to use in the encoder. ",
parameter_metadata=ENCODER_METADATA["conv_params"]["conv_layers"],
)
fc_dropout: float | None = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout rate",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["fc_dropout"],
)
fc_activation: str | None = schema_utils.ActivationOptions(
description="If an activation is not already specified in fc_layers this is the default activation that will "
"be used for each layer. It indicates the activation function applied to the output.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["fc_activation"],
)
fc_use_bias: bool | None = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["fc_use_bias"],
)
fc_bias_initializer: str | None = schema_utils.StringOptions(
sorted(list(initializer_registry.keys())),
default="zeros",
description="Initializer for the bias vector.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["fc_bias_initializer"],
)
fc_weights_initializer: str | None = schema_utils.StringOptions(
sorted(list(initializer_registry.keys())),
default="xavier_uniform",
description="Initializer for the weights matrix.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["fc_weights_initializer"],
)
fc_norm: str | None = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="If a norm is not already specified in fc_layers this is the default norm that will be used for "
"each layer. It indicates the norm of the output and can be null, batch or layer.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["fc_norm"],
)
fc_norm_params: dict[str, Any] | None = schema_utils.Dict(
default=None,
description="Parameters used if norm is either batch or layer. For information on parameters used with batch "
"see Torch's documentation on batch normalization or for layer see Torch's documentation on layer "
"normalization.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["fc_norm_params"],
)
num_fc_layers: int | None | None = schema_utils.PositiveInteger(
default=1,
description="The number of stacked fully connected layers.",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["num_fc_layers"],
)
fc_layers: list[dict] | None | None = schema_utils.DictList(
default=None,
description="A list of dictionaries containing the parameters of all the fully connected layers. The length "
"of the list determines the number of stacked fully connected layers and the content of each "
"dictionary determines the parameters for a specific layer. The available parameters for each "
"layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and "
"weights_initializer. If any of those values is missing from the dictionary, the default one "
"specified as a parameter of the encoder will be used instead. ",
parameter_metadata=ENCODER_METADATA["Stacked2DCNN"]["fc_layers"],
)
@DeveloperAPI
@register_encoder_config("_resnet_legacy", IMAGE)
@ludwig_dataclass
class ResNetConfig(ImageEncoderConfig):
@staticmethod
def module_name():
return "ResNet"
type: str = schema_utils.ProtectedString(
"_resnet_legacy",
description=ENCODER_METADATA["ResNet"]["type"].long_description,
)
dropout: float | None = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout rate",
parameter_metadata=ENCODER_METADATA["ResNet"]["dropout"],
)
activation: str | None = schema_utils.ActivationOptions(
description="if an activation is not already specified in fc_layers this is the default activation that will "
"be used for each layer. It indicates the activation function applied to the output.",
parameter_metadata=ENCODER_METADATA["ResNet"]["activation"],
)
height: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Height of the input image.",
parameter_metadata=ENCODER_METADATA["ResNet"]["height"],
)
width: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Width of the input image.",
parameter_metadata=ENCODER_METADATA["ResNet"]["width"],
)
resnet_size: int | None = schema_utils.PositiveInteger(
default=50,
description="The size of the ResNet model to use.",
parameter_metadata=ENCODER_METADATA["ResNet"]["resnet_size"],
)
num_channels: int | None = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Number of channels to use in the encoder. ",
parameter_metadata=ENCODER_METADATA["ResNet"]["num_channels"],
)
out_channels: int | None = schema_utils.NonNegativeInteger(
default=32,
description="Indicates the number of filters, and by consequence the output channels of the 2d convolution. "
"If out_channels is not already specified in conv_layers this is the default out_channels that "
"will be used for each layer. ",
parameter_metadata=ENCODER_METADATA["ResNet"]["out_channels"],
)
kernel_size: int | tuple[int] | None = schema_utils.OneOfOptionsField(
default=3,
allow_none=True,
description="An integer or pair of integers specifying the kernel size. A single integer specifies a square "
"kernel, while a pair of integers specifies the height and width of the kernel in that order (h, "
"w). If a kernel_size is not specified in conv_layers this kernel_size that will be used for "
"each layer.",
field_options=[
schema_utils.PositiveInteger(allow_none=True, description="", default=None),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["ResNet"]["kernel_size"],
)
conv_stride: int | tuple[int] = schema_utils.OneOfOptionsField(
default=1,
allow_none=True,
description="An integer or pair of integers specifying the stride of the initial convolutional layer.",
field_options=[
schema_utils.PositiveInteger(allow_none=True, description="", default=None),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["ResNet"]["conv_stride"],
)
first_pool_kernel_size: int | tuple[int] = schema_utils.OneOfOptionsField(
default=None,
allow_none=True,
description="Pool size to be used for the first pooling layer. If none, the first pooling layer is skipped.",
field_options=[
schema_utils.PositiveInteger(allow_none=True, description="", default=None),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["ResNet"]["first_pool_kernel_size"],
)
first_pool_stride: int | tuple[int] = schema_utils.OneOfOptionsField(
default=None,
allow_none=True,
description="Stride for first pooling layer. If null, defaults to first_pool_kernel_size.",
field_options=[
schema_utils.PositiveInteger(allow_none=True, description="", default=None),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["ResNet"]["first_pool_stride"],
)
batch_norm_momentum: float = schema_utils.NonNegativeFloat(
default=0.9,
description="Momentum of the batch norm running statistics.",
parameter_metadata=ENCODER_METADATA["ResNet"]["batch_norm_momentum"],
)
batch_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=0.001,
description="Epsilon of the batch norm.",
parameter_metadata=ENCODER_METADATA["ResNet"]["batch_norm_epsilon"],
)
use_bias: bool | None = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=ENCODER_METADATA["ResNet"]["use_bias"],
)
bias_initializer: str | None = schema_utils.StringOptions(
sorted(list(initializer_registry.keys())),
default="zeros",
description="initializer for the bias vector.",
parameter_metadata=ENCODER_METADATA["ResNet"]["bias_initializer"],
)
weights_initializer: str | None = schema_utils.StringOptions(
sorted(list(initializer_registry.keys())),
default="xavier_uniform",
description="Initializer for the weights matrix.",
parameter_metadata=ENCODER_METADATA["ResNet"]["weights_initializer"],
)
output_size: int | None = schema_utils.PositiveInteger(
default=128,
description="if output_size is not already specified in fc_layers this is the default output_size that will "
"be used for each layer. It indicates the size of the output of a fully connected layer. ",
parameter_metadata=ENCODER_METADATA["ResNet"]["output_size"],
)
norm: str | None = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="if a norm is not already specified in fc_layers this is the default norm that will be used for "
"each layer. It indicates the norm of the output and can be null, batch or layer.",
parameter_metadata=ENCODER_METADATA["ResNet"]["norm"],
)
norm_params: dict[str, Any] | None = schema_utils.Dict(
default=None,
description="parameters used if norm is either batch or layer. For information on parameters used with batch "
"see Torch's documentation on batch normalization or for layer see Torch's documentation on layer "
"normalization.",
parameter_metadata=ENCODER_METADATA["ResNet"]["norm_params"],
)
num_fc_layers: int | None | None = schema_utils.PositiveInteger(
default=1,
description="The number of stacked fully connected layers.",
parameter_metadata=ENCODER_METADATA["ResNet"]["num_fc_layers"],
)
fc_layers: list[dict] | None | None = schema_utils.DictList(
default=None,
description="A list of dictionaries containing the parameters of all the fully connected layers. The length "
"of the list determines the number of stacked fully connected layers and the content of each "
"dictionary determines the parameters for a specific layer. The available parameters for each "
"layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and "
"weights_initializer. If any of those values is missing from the dictionary, the default one "
"specified as a parameter of the encoder will be used instead. ",
parameter_metadata=ENCODER_METADATA["ResNet"]["fc_layers"],
)
@DeveloperAPI
@register_encoder_config("mlp_mixer", IMAGE)
@ludwig_dataclass
class MLPMixerConfig(ImageEncoderConfig):
@staticmethod
def module_name():
return "MLPMixer"
type: str = schema_utils.ProtectedString(
"mlp_mixer",
description=ENCODER_METADATA["MLPMixer"]["type"].long_description,
)
dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout rate.",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["dropout"],
)
height: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Height of the input image.",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["height"],
)
width: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Width of the input image.",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["width"],
)
num_channels: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Number of channels to use in the encoder. ",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["num_channels"],
)
patch_size: int = schema_utils.PositiveInteger(
default=16,
description="The image patch size. Each patch is patch_size² pixels. Must evenly divide the image width and "
"height.",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["patch_size"],
)
embed_size: int = schema_utils.PositiveInteger(
default=512,
description="The patch embedding size, the output size of the mixer if avg_pool is true.",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["embed_size"],
)
token_size: int = schema_utils.PositiveInteger(
default=2048,
description="The per-patch embedding size.",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["token_size"],
)
channel_dim: int = schema_utils.PositiveInteger(
default=256,
description="Number of channels in hidden layer.",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["channel_dim"],
)
num_layers: int = schema_utils.PositiveInteger(
default=8,
description="The depth of the network (the number of Mixer blocks).",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["num_layers"],
)
avg_pool: bool = schema_utils.Boolean(
default=True,
description="If true, pools output over patch dimension, outputs a vector of shape (embed_size). If false, "
"the output tensor is of shape (n_patches, embed_size), where n_patches is img_height x img_width "
"/ patch_size².",
parameter_metadata=ENCODER_METADATA["MLPMixer"]["avg_pool"],
)
@DeveloperAPI
@register_encoder_config("_vit_legacy", IMAGE)
@ludwig_dataclass
class ViTConfig(ImageEncoderConfig):
@staticmethod
def module_name():
return "ViT"
type: str = schema_utils.ProtectedString(
"_vit_legacy",
description=ENCODER_METADATA["ViT"]["type"].long_description,
)
height: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Height of the input image.",
parameter_metadata=ENCODER_METADATA["ViT"]["height"],
)
width: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Width of the input image.",
parameter_metadata=ENCODER_METADATA["ViT"]["width"],
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ViT"]["num_hidden_layers"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the encoder layers and the pooling layer.",
parameter_metadata=ENCODER_METADATA["ViT"]["hidden_size"],
)
hidden_act: str = schema_utils.StringOptions(
["relu", "gelu", "selu", "gelu_new"],
default="gelu",
description="Hidden layer activation, one of gelu, relu, selu or gelu_new.",
parameter_metadata=ENCODER_METADATA["ViT"]["hidden_act"],
)
hidden_dropout_prob: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout rate for all fully connected layers in the embeddings, encoder, and pooling.",
parameter_metadata=ENCODER_METADATA["ViT"]["hidden_dropout_prob"],
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads in each attention layer.",
parameter_metadata=ENCODER_METADATA["ViT"]["num_attention_heads"],
)
attention_probs_dropout_prob: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout rate for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["ViT"]["attention_probs_dropout_prob"],
)
intermediate_size: int = schema_utils.PositiveInteger(
default=3072,
description="Dimensionality of the intermediate (i.e., feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ViT"]["intermediate_size"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["ViT"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["ViT"]["layer_norm_eps"],
)
gradient_checkpointing: bool = schema_utils.Boolean(
default=False,
description="",
parameter_metadata=ENCODER_METADATA["ViT"]["gradient_checkpointing"],
)
patch_size: int = schema_utils.PositiveInteger(
default=16,
description="The image patch size. Each patch is patch_size² pixels. Must evenly divide the image width and "
"height.",
parameter_metadata=ENCODER_METADATA["ViT"]["patch_size"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["ViT"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=True,
description="Is the encoder trainable.",
parameter_metadata=ENCODER_METADATA["ViT"]["trainable"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Use pre-trained model weights from Hugging Face.",
parameter_metadata=ENCODER_METADATA["ViT"]["use_pretrained"],
)
pretrained_model: str = schema_utils.String(
default="google/vit-base-patch16-224",
description="The name of the pre-trained model to use.",
parameter_metadata=ENCODER_METADATA["ViT"]["pretrained_model"],
)
def set_fixed_preprocessing_params(self, model_type: str, preprocessing: "ImagePreprocessingConfig"):
"""If the encoder is not in trainable mode, override the image width and height to be compatible with the
pretrained encoder image dimension requirements."""
if self.requires_equal_dimensions() and self.required_width() != self.required_height():
raise ValueError("Invalid definition. `required_width` and `required_height` are not equal")
preprocessing.requires_equal_dimensions = self.requires_equal_dimensions()
if not self.trainable or self.use_pretrained:
preprocessing.height = self.required_height()
preprocessing.width = self.required_width()
@classmethod
def requires_equal_dimensions(cls) -> bool:
return True
@classmethod
def required_width(cls) -> int | None:
return 224
@classmethod
def required_height(cls) -> int | None:
return 224
def is_pretrained(self) -> bool:
return self.use_pretrained
@DeveloperAPI
@register_encoder_config("unet", IMAGE)
@ludwig_dataclass
class UNetEncoderConfig(ImageEncoderConfig):
@staticmethod
def module_name():
return "UNetEncoder"
type: str = schema_utils.ProtectedString(
"unet",
description=ENCODER_METADATA["UNetEncoder"]["type"].long_description,
)
height: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Height of the input image.",
parameter_metadata=ENCODER_METADATA["UNetEncoder"]["height"],
)
width: int = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Width of the input image.",
parameter_metadata=ENCODER_METADATA["UNetEncoder"]["width"],
)
num_channels: int | None = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="Number of channels in the input image. ",
parameter_metadata=ENCODER_METADATA["UNetEncoder"]["num_channels"],
)
conv_norm: str | None = schema_utils.StringOptions(
["batch"],
default="batch",
allow_none=True,
description="This is the default norm that will be used for each double conv layer." "It can be null or batch.",
parameter_metadata=ENCODER_METADATA["UNetEncoder"]["conv_norm"],
)
================================================
FILE: ludwig/schema/encoders/image/timm.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import IMAGE
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@ludwig_dataclass
class TimmBaseConfig(BaseEncoderConfig):
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Download model weights from pretrained model.",
parameter_metadata=ENCODER_METADATA["TimmEncoder"]["use_pretrained"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Whether to use weights saved in the Ludwig checkpoint instead of pretrained weights.",
parameter_metadata=ENCODER_METADATA["TimmEncoder"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=True,
description="Whether the encoder parameters are trainable.",
parameter_metadata=ENCODER_METADATA["TimmEncoder"]["trainable"],
)
def is_pretrained(self) -> bool:
return self.use_pretrained
@DeveloperAPI
@register_encoder_config("timm", IMAGE)
@ludwig_dataclass
class TimmEncoderConfig(TimmBaseConfig):
type: str = schema_utils.ProtectedString("timm", description="Type of encoder.")
model_name: str = schema_utils.String(
default="caformer_s18",
description=(
"Name of the timm model to use. Any model from the timm library is supported. "
"See https://huggingface.co/docs/timm for available models."
),
parameter_metadata=ENCODER_METADATA["TimmEncoder"]["model_name"],
)
# Convenience aliases for MetaFormer variants with curated model_name options
CAFORMER_MODELS = [
"caformer_s18",
"caformer_s36",
"caformer_m36",
"caformer_b36",
"caformer_s18.sail_in22k_ft_in1k",
"caformer_s18.sail_in22k_ft_in1k_384",
"caformer_s36.sail_in22k_ft_in1k",
"caformer_s36.sail_in22k_ft_in1k_384",
"caformer_m36.sail_in22k_ft_in1k",
"caformer_m36.sail_in22k_ft_in1k_384",
"caformer_b36.sail_in22k_ft_in1k",
"caformer_b36.sail_in22k_ft_in1k_384",
]
CONVFORMER_MODELS = [
"convformer_s18",
"convformer_s36",
"convformer_m36",
"convformer_b36",
"convformer_s18.sail_in22k_ft_in1k",
"convformer_s18.sail_in22k_ft_in1k_384",
"convformer_s36.sail_in22k_ft_in1k",
"convformer_s36.sail_in22k_ft_in1k_384",
"convformer_m36.sail_in22k_ft_in1k",
"convformer_m36.sail_in22k_ft_in1k_384",
"convformer_b36.sail_in22k_ft_in1k",
"convformer_b36.sail_in22k_ft_in1k_384",
]
POOLFORMER_MODELS = [
"poolformerv2_s12",
"poolformerv2_s24",
"poolformerv2_s36",
"poolformerv2_m36",
"poolformerv2_m48",
"poolformer_s12",
"poolformer_s24",
"poolformer_s36",
"poolformer_m36",
"poolformer_m48",
]
@DeveloperAPI
@register_encoder_config("caformer", IMAGE)
@ludwig_dataclass
class TimmCAFormerEncoderConfig(TimmBaseConfig):
type: str = schema_utils.ProtectedString("caformer", description="Type of encoder.")
model_name: str = schema_utils.StringOptions(
CAFORMER_MODELS,
default="caformer_s18",
allow_none=False,
description=(
"CAFormer model variant. Hybrid Conv+Attention MetaFormer achieving SOTA accuracy. "
"Variants with '.sail_in22k_ft_in1k' are pretrained on ImageNet-21K and finetuned on ImageNet-1K. "
"Variants with '_384' use 384x384 input resolution."
),
parameter_metadata=ENCODER_METADATA["TimmCAFormerEncoder"]["model_name"],
)
@DeveloperAPI
@register_encoder_config("convformer", IMAGE)
@ludwig_dataclass
class TimmConvFormerEncoderConfig(TimmBaseConfig):
type: str = schema_utils.ProtectedString("convformer", description="Type of encoder.")
model_name: str = schema_utils.StringOptions(
CONVFORMER_MODELS,
default="convformer_s18",
allow_none=False,
description=(
"ConvFormer model variant. Pure CNN MetaFormer that outperforms ConvNeXt. "
"Variants with '.sail_in22k_ft_in1k' are pretrained on ImageNet-21K and finetuned on ImageNet-1K."
),
parameter_metadata=ENCODER_METADATA["TimmConvFormerEncoder"]["model_name"],
)
@DeveloperAPI
@register_encoder_config("poolformer", IMAGE)
@ludwig_dataclass
class TimmPoolFormerEncoderConfig(TimmBaseConfig):
type: str = schema_utils.ProtectedString("poolformer", description="Type of encoder.")
model_name: str = schema_utils.StringOptions(
POOLFORMER_MODELS,
default="poolformerv2_s12",
allow_none=False,
description=(
"PoolFormer model variant. MetaFormer using simple average pooling as token mixer. "
"V2 variants use StarReLU activation and improved training recipe."
),
parameter_metadata=ENCODER_METADATA["TimmPoolFormerEncoder"]["model_name"],
)
================================================
FILE: ludwig/schema/encoders/image/torchvision.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import IMAGE
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@ludwig_dataclass
class TVBaseEncoderConfig(BaseEncoderConfig):
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Download model weights from pre-trained model.",
parameter_metadata=ENCODER_METADATA["TVBaseEncoder"]["use_pretrained"],
)
model_cache_dir: str | None = schema_utils.String(
default=None,
allow_none=True,
description="Directory path to cache pretrained model weights.",
parameter_metadata=ENCODER_METADATA["TVBaseEncoder"]["model_cache_dir"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Whether to save the weights in the checkpoint.",
parameter_metadata=ENCODER_METADATA["TVBaseEncoder"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=True,
description="Is the encoder trainable.",
parameter_metadata=ENCODER_METADATA["TVBaseEncoder"]["trainable"],
)
def is_pretrained(self) -> bool:
return self.use_pretrained
@DeveloperAPI
@register_encoder_config("alexnet", IMAGE)
@ludwig_dataclass
class TVAlexNetEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("alexnet", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
["base"],
default="base",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVAlexNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("convnext", IMAGE)
@ludwig_dataclass
class TVConvNeXtEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("convnext", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
["tiny", "small", "base", "large"],
default="base",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVConvNeXtEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("densenet", IMAGE)
@ludwig_dataclass
class TVDenseNetEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("densenet", description="Type of encoder.")
model_variant: int = schema_utils.IntegerOptions(
[121, 161, 169, 201],
default=121,
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVDenseNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("efficientnet", IMAGE)
@ludwig_dataclass
class TVEfficientNetEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("efficientnet", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
[
"b0",
"b1",
"b2",
"b3",
"b4",
"b5",
"b6",
"b7",
"v2_s",
"v2_m",
"v2_l",
],
default="b0",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVEfficientNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("googlenet", IMAGE)
@ludwig_dataclass
class TVGoogLeNetEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("googlenet", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
["base"],
default="base",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVGoogLeNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("inceptionv3", IMAGE)
@ludwig_dataclass
class TVInceptionV3EncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("inceptionv3", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
["base"],
default="base",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVGoogLeNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("maxvit", IMAGE)
@ludwig_dataclass
class TVMaxVitEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("maxvit", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
["t"],
default="t",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVMNASNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("mnasnet", IMAGE)
@ludwig_dataclass
class TVMNASNetEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("mnasnet", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
["0_5", "0_75", "1_0", "1_3"],
default="0_5",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVMNASNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("mobilenetv2", IMAGE)
@ludwig_dataclass
class TVMobileNetV2EncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("mobilenetv2", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
["base"],
default="base",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVMobileNetV2Encoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("mobilenetv3", IMAGE)
@ludwig_dataclass
class TVMobileNetV3EncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("mobilenetv3", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
[
"small",
"large",
],
default="small",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVMobileNetV3Encoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("regnet", IMAGE)
@ludwig_dataclass
class TVRegNetEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("regnet", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
[
"x_1_6gf",
"x_16gf",
"x_32gf",
"x_3_2gf",
"x_400mf",
"x_800mf",
"x_8gf",
"y_128gf",
"y_16gf",
"y_1_6gf",
"y_32gf",
"y_3_2gf",
"y_400mf",
"y_800mf",
"y_8gf",
],
default="x_1_6gf",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVRegNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("resnet", IMAGE)
@ludwig_dataclass
class TVResNetEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("resnet", description="Type of encoder.")
model_variant: int = schema_utils.IntegerOptions(
[18, 34, 50, 101, 152],
default=50,
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVResNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("resnext", IMAGE)
@ludwig_dataclass
class TVResNeXtEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("resnext", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
["50_32x4d", "101_32x8d", "101_64x4d"],
default="50_32x4d",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVResNeXtEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("shufflenet_v2", IMAGE)
@ludwig_dataclass
class TVShuffleNetV2EncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("shufflenet_v2", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
[
"x0_5",
"x1_0",
"x1_5",
"x2_0",
],
default="x0_5",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVShuffleNetV2Encoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("squeezenet", IMAGE)
@ludwig_dataclass
class TVSqueezeNetEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("squeezenet", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
[
"1_0",
"1_1",
],
default="1_0",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVSqueezeNetEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("swin_transformer", IMAGE)
@ludwig_dataclass
class TVSwinTransformerEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("swin_transformer", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
[
"t",
"s",
"b",
],
default="t",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVSwinTransformerEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("vit", IMAGE)
@ludwig_dataclass
class TVViTEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("vit", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
[
"b_16",
"b_32",
"l_16",
"l_32",
"h_14",
],
default="b_16",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVViTEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("vgg", IMAGE)
@ludwig_dataclass
class TVVGGEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("vgg", description="Type of encoder.")
model_variant: int | str = schema_utils.OneOfOptionsField(
default=11,
description="Pretrained model variant to use.",
field_options=[
schema_utils.IntegerOptions(
[
11,
13,
16,
19,
],
default=11,
allow_none=False,
),
schema_utils.StringOptions(
[
"11_bn",
"13_bn",
"16_bn",
"19_bn",
],
default="11_bn",
allow_none=False,
),
],
allow_none=False,
parameter_metadata=ENCODER_METADATA["TVVGGEncoder"]["model_variant"],
)
@DeveloperAPI
@register_encoder_config("wide_resnet", IMAGE)
@ludwig_dataclass
class TVWideResNetEncoderConfig(TVBaseEncoderConfig):
type: str = schema_utils.ProtectedString("wide_resnet", description="Type of encoder.")
model_variant: str = schema_utils.StringOptions(
[
"50_2",
"101_2",
],
default="50_2",
allow_none=False,
description="Pretrained model variant to use.",
parameter_metadata=ENCODER_METADATA["TVViTEncoder"]["model_variant"],
)
================================================
FILE: ludwig/schema/encoders/sequence_encoders.py
================================================
from dataclasses import Field
from typing import TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import AUDIO, SEQUENCE, TEXT, TIMESERIES
from ludwig.schema import common_fields
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
if TYPE_CHECKING:
from ludwig.schema.features.preprocessing.sequence import SequencePreprocessingConfig
CONV_LAYERS_DESCRIPTION = """
A list of dictionaries containing the parameters of all the convolutional layers.
The length of the list determines the number of stacked convolutional layers and the content of each dictionary
determines the parameters for a specific layer. The available parameters for each layer are: `activation`, `dropout`,
`norm`, `norm_params`, `num_filters`, `filter_size`, `strides`, `padding`, `dilation_rate`, `use_bias`, `pool_function`,
`pool_padding`, `pool_size`, `pool_strides`, `bias_initializer`, `weights_initializer`. If any of those values is
missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. If both
`conv_layers` and `num_conv_layers` are `null`, a default list will be assigned to `conv_layers` with the value
`[{filter_size: 7, pool_size: 3}, {filter_size: 7, pool_size: 3}, {filter_size: 3, pool_size: null},
{filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: null}, {filter_size: 3, pool_size: 3}]`.
"""
NUM_CONV_LAYERS_DESCRIPTION = "The number of stacked convolutional layers when `conv_layers` is `null`."
def NumFiltersField(default: int = 256) -> Field:
return schema_utils.PositiveInteger(
default=default,
description="Number of filters, and by consequence number of output channels of the 1d convolution.",
parameter_metadata=ENCODER_METADATA["conv_params"]["num_filters"],
)
def FilterSizeField(default: int = 3) -> Field:
return schema_utils.PositiveInteger(
default=default,
description="Size of the 1d convolutional filter. It indicates how wide the 1d convolutional filter is.",
parameter_metadata=ENCODER_METADATA["conv_params"]["filter_size"],
)
def PoolFunctionField(default: str = "max") -> Field:
return schema_utils.ReductionOptions(
default=default,
description=(
"Pooling function to use. `max` will select the maximum value. Any of `average`, `avg`, or "
"`mean` will compute the mean value"
),
parameter_metadata=ENCODER_METADATA["conv_params"]["pool_function"],
)
def PoolSizeField(default: int | None = None) -> Field:
return schema_utils.PositiveInteger(
default=None,
allow_none=True,
description=(
"The default pool_size that will be used for each layer. If a pool_size is not already specified "
"in conv_layers this is the default pool_size that will be used for each layer. It indicates the size of "
"the max pooling that will be performed along the `s` sequence dimension after the convolution operation."
),
parameter_metadata=ENCODER_METADATA["conv_params"]["pool_size"],
)
@DeveloperAPI
@ludwig_dataclass
class SequenceEncoderConfig(BaseEncoderConfig):
"""Base class for sequence encoders."""
def set_fixed_preprocessing_params(self, model_type: str, preprocessing: "SequencePreprocessingConfig"):
if isinstance(preprocessing, dict):
preprocessing["cache_encoder_embeddings"] = False
else:
preprocessing.cache_encoder_embeddings = False
@DeveloperAPI
@register_encoder_config("passthrough", [TIMESERIES])
@ludwig_dataclass
class SequencePassthroughConfig(SequenceEncoderConfig):
@staticmethod
def module_name():
return "SequencePassthrough"
type: str = schema_utils.ProtectedString(
"passthrough",
description=ENCODER_METADATA["SequencePassthrough"]["type"].long_description,
)
max_sequence_length: int = common_fields.MaxSequenceLengthField()
encoding_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The size of the encoding vector, or None if sequence elements are scalars.",
parameter_metadata=ENCODER_METADATA["SequencePassthrough"]["encoding_size"],
)
reduce_output: str = common_fields.ReduceOutputField(default=None)
@DeveloperAPI
@register_encoder_config("embed", [SEQUENCE, TEXT])
@ludwig_dataclass
class SequenceEmbedConfig(SequenceEncoderConfig):
@staticmethod
def module_name():
return "SequenceEmbed"
type: str = schema_utils.ProtectedString(
"embed",
description=ENCODER_METADATA["SequenceEmbed"]["type"].long_description,
)
dropout: float = common_fields.DropoutField(description="Dropout rate applied to the embedding.")
max_sequence_length: int = common_fields.MaxSequenceLengthField()
representation: str = common_fields.RepresentationField()
vocab: list = common_fields.VocabField()
weights_initializer: str = common_fields.WeightsInitializerField(default="uniform")
reduce_output: str = common_fields.ReduceOutputField()
embedding_size: int = common_fields.EmbeddingSizeField()
embeddings_on_cpu: bool = common_fields.EmbeddingsOnCPUField()
embeddings_trainable: bool = common_fields.EmbeddingsTrainableField()
pretrained_embeddings: str = common_fields.PretrainedEmbeddingsField()
@DeveloperAPI
@register_encoder_config("parallel_cnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
@ludwig_dataclass
class ParallelCNNConfig(SequenceEncoderConfig):
@staticmethod
def module_name():
return "ParallelCNN"
type: str = schema_utils.ProtectedString(
"parallel_cnn",
description=ENCODER_METADATA["ParallelCNN"]["type"].long_description,
)
dropout: float = common_fields.DropoutField(description="Dropout rate applied to the embedding.")
activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each layer."
)
max_sequence_length: int = common_fields.MaxSequenceLengthField()
representation: str = common_fields.RepresentationField()
vocab: list = common_fields.VocabField()
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether to use a bias vector.",
parameter_metadata=ENCODER_METADATA["ParallelCNN"]["use_bias"],
)
bias_initializer: str = common_fields.BiasInitializerField()
weights_initializer: str = common_fields.WeightsInitializerField()
should_embed: bool = schema_utils.Boolean(
default=True,
description="Whether to embed the input sequence.",
parameter_metadata=ENCODER_METADATA["ParallelCNN"]["should_embed"],
)
embedding_size: int = common_fields.EmbeddingSizeField()
embeddings_on_cpu: bool = common_fields.EmbeddingsOnCPUField()
embeddings_trainable: bool = common_fields.EmbeddingsTrainableField()
pretrained_embeddings: str = common_fields.PretrainedEmbeddingsField()
reduce_output: str = common_fields.ReduceOutputField()
num_conv_layers: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description=NUM_CONV_LAYERS_DESCRIPTION,
parameter_metadata=ENCODER_METADATA["conv_params"]["num_conv_layers"],
)
conv_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for conv_layers
default=None,
description=CONV_LAYERS_DESCRIPTION,
parameter_metadata=ENCODER_METADATA["conv_params"]["conv_layers"],
)
num_filters: int = NumFiltersField()
filter_size: int = FilterSizeField()
pool_function: str = PoolFunctionField()
pool_size: int = PoolSizeField()
output_size: int = schema_utils.PositiveInteger(
default=256,
description="The default output_size that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["ParallelCNN"]["output_size"],
)
norm: str = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="The default norm that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["ParallelCNN"]["norm"],
)
norm_params: dict = schema_utils.Dict(
default=None,
description="Parameters used if norm is either `batch` or `layer`.",
parameter_metadata=ENCODER_METADATA["ParallelCNN"]["norm_params"],
)
num_fc_layers: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of parallel fully connected layers to use.",
parameter_metadata=ENCODER_METADATA["ParallelCNN"]["num_fc_layers"],
)
fc_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for fc_layers
default=None,
description="List of dictionaries containing the parameters for each fully connected layer.",
parameter_metadata=ENCODER_METADATA["ParallelCNN"]["fc_layers"],
)
@DeveloperAPI
@register_encoder_config("stacked_cnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
@ludwig_dataclass
class StackedCNNConfig(SequenceEncoderConfig):
@staticmethod
def module_name():
return "StackedCNN"
type: str = schema_utils.ProtectedString(
"stacked_cnn",
description=ENCODER_METADATA["StackedCNN"]["type"].long_description,
)
dropout: float = common_fields.DropoutField(description="Dropout rate applied to the embedding.")
activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each layer."
)
max_sequence_length: int = common_fields.MaxSequenceLengthField()
representation: str = common_fields.RepresentationField()
vocab: list = common_fields.VocabField()
num_conv_layers: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description=NUM_CONV_LAYERS_DESCRIPTION,
parameter_metadata=ENCODER_METADATA["conv_params"]["num_conv_layers"],
)
conv_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for conv_layers
default=None,
description=CONV_LAYERS_DESCRIPTION,
parameter_metadata=ENCODER_METADATA["conv_params"]["conv_layers"],
)
num_filters: int = NumFiltersField()
filter_size: int = FilterSizeField()
pool_function: str = PoolFunctionField()
pool_size: int = PoolSizeField()
strides: int = schema_utils.PositiveInteger(
default=1,
description="Stride length of the convolution.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["strides"],
)
padding: str = schema_utils.StringOptions(
["valid", "same"],
default="same",
description="Padding to use.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["padding"],
)
dilation_rate: int = schema_utils.PositiveInteger(
default=1,
description="Dilation rate to use for dilated convolution.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["dilation_rate"],
)
pool_strides: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Factor to scale down.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["pool_strides"],
)
pool_padding: str = schema_utils.StringOptions(
["valid", "same"],
default="same",
description="Padding to use.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["pool_padding"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether to use a bias vector.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["use_bias"],
)
bias_initializer: str = common_fields.BiasInitializerField()
weights_initializer: str = common_fields.WeightsInitializerField()
should_embed: bool = schema_utils.Boolean(
default=True,
description="Whether to embed the input sequence.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["should_embed"],
)
embedding_size: int = common_fields.EmbeddingSizeField()
embeddings_on_cpu: bool = common_fields.EmbeddingsOnCPUField()
embeddings_trainable: bool = common_fields.EmbeddingsTrainableField()
pretrained_embeddings: str = common_fields.PretrainedEmbeddingsField()
reduce_output: str = common_fields.ReduceOutputField()
output_size: int = schema_utils.PositiveInteger(
default=256,
description="The default output_size that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["output_size"],
)
norm: str = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="The default norm that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["norm"],
)
norm_params: dict = schema_utils.Dict(
default=None,
description="Parameters used if norm is either `batch` or `layer`.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["norm_params"],
)
num_fc_layers: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of parallel fully connected layers to use.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["num_fc_layers"],
)
fc_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for fc_layers
default=None,
description="List of dictionaries containing the parameters for each fully connected layer.",
parameter_metadata=ENCODER_METADATA["StackedCNN"]["fc_layers"],
)
@DeveloperAPI
@register_encoder_config("stacked_parallel_cnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
@ludwig_dataclass
class StackedParallelCNNConfig(SequenceEncoderConfig):
@staticmethod
def module_name():
return "StackedParallelCNN"
type: str = schema_utils.ProtectedString(
"stacked_parallel_cnn",
description=ENCODER_METADATA["StackedParallelCNN"]["type"].long_description,
)
dropout: float = common_fields.DropoutField(description="Dropout rate applied to the embedding.")
activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each layer."
)
max_sequence_length: int = common_fields.MaxSequenceLengthField()
representation: str = common_fields.RepresentationField()
vocab: list = common_fields.VocabField()
num_stacked_layers: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="If stacked_layers is null, this is the number of elements in the stack of parallel convolutional "
"layers. ",
parameter_metadata=ENCODER_METADATA["StackedParallelCNN"]["num_stacked_layers"],
)
stacked_layers: list[dict] = schema_utils.DictList(
default=None,
description="a nested list of lists of dictionaries containing the parameters of the stack of parallel "
"convolutional layers. The length of the list determines the number of stacked parallel "
"convolutional layers, length of the sub-lists determines the number of parallel conv layers and "
"the content of each dictionary determines the parameters for a specific layer. ",
parameter_metadata=ENCODER_METADATA["StackedParallelCNN"]["stacked_layers"],
)
num_filters: int = NumFiltersField()
filter_size: int = FilterSizeField()
pool_function: str = PoolFunctionField()
pool_size: int = PoolSizeField()
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether to use a bias vector.",
parameter_metadata=ENCODER_METADATA["StackedParallelCNN"]["use_bias"],
)
bias_initializer: str = common_fields.BiasInitializerField()
weights_initializer: str = common_fields.WeightsInitializerField()
should_embed: bool = schema_utils.Boolean(
default=True,
description="If True the input sequence is expected to be made of integers and will be mapped into embeddings",
parameter_metadata=ENCODER_METADATA["StackedParallelCNN"]["should_embed"],
)
embedding_size: int = common_fields.EmbeddingSizeField()
embeddings_on_cpu: bool = common_fields.EmbeddingsOnCPUField()
embeddings_trainable: bool = common_fields.EmbeddingsTrainableField()
pretrained_embeddings: str = common_fields.PretrainedEmbeddingsField()
reduce_output: str = common_fields.ReduceOutputField()
output_size: int = schema_utils.PositiveInteger(
default=256,
description="The default output_size that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["StackedParallelCNN"]["output_size"],
)
norm: str = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="The default norm that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["StackedParallelCNN"]["norm"],
)
norm_params: dict = schema_utils.Dict(
default=None,
description="Parameters used if norm is either `batch` or `layer`.",
parameter_metadata=ENCODER_METADATA["StackedParallelCNN"]["norm_params"],
)
num_fc_layers: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of parallel fully connected layers to use.",
parameter_metadata=ENCODER_METADATA["StackedParallelCNN"]["num_fc_layers"],
)
fc_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for fc_layers
default=None,
description="List of dictionaries containing the parameters for each fully connected layer.",
parameter_metadata=ENCODER_METADATA["StackedParallelCNN"]["fc_layers"],
)
@DeveloperAPI
@register_encoder_config("rnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
@ludwig_dataclass
class StackedRNNConfig(SequenceEncoderConfig):
@staticmethod
def module_name():
return "StackedRNN"
type: str = schema_utils.ProtectedString(
"rnn",
description=ENCODER_METADATA["StackedRNN"]["type"].long_description,
)
dropout: float = common_fields.DropoutField(description="Dropout rate.")
recurrent_dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="The dropout rate for the recurrent state",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["recurrent_dropout"],
)
activation: str = schema_utils.ActivationOptions(default="tanh", description="The default activation function.")
recurrent_activation: str = schema_utils.ActivationOptions(
default="sigmoid",
description="The activation function to use in the recurrent step",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["recurrent_activation"],
)
max_sequence_length: int = common_fields.MaxSequenceLengthField()
representation: str = common_fields.RepresentationField()
vocab: list = common_fields.VocabField()
cell_type: str = schema_utils.StringOptions(
["rnn", "lstm", "gru"],
default="rnn",
description="The type of recurrent cell to use. Available values are: `rnn`, `lstm`, `gru`. For reference "
"about the differences between the cells please refer to "
"[torch.nn Recurrent Layers](https://pytorch.org/docs/stable/nn.html#recurrent-layers).",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["cell_type"],
)
num_layers: int = schema_utils.PositiveInteger(
default=1,
description="The number of stacked recurrent layers.",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["num_layers"],
)
state_size: int = schema_utils.PositiveInteger(
default=256,
description="The size of the state of the rnn.",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["state_size"],
)
bidirectional: bool = schema_utils.Boolean(
default=False,
description="If true, two recurrent networks will perform encoding in the forward and backward direction and "
"their outputs will be concatenated.",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["bidirectional"],
)
unit_forget_bias: bool = schema_utils.Boolean(
default=True,
description="If true, add 1 to the bias of the forget gate at initialization",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["unit_forget_bias"],
)
recurrent_initializer: str = schema_utils.InitializerOptions(
default="orthogonal",
description="The initializer for recurrent matrix weights",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["recurrent_initializer"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether to use a bias vector.",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["use_bias"],
)
bias_initializer: str = common_fields.BiasInitializerField()
weights_initializer: str = common_fields.WeightsInitializerField()
should_embed: bool = schema_utils.Boolean(
default=True,
description="If True the input sequence is expected to be made of integers and will be mapped into embeddings",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["should_embed"],
)
embedding_size: int = common_fields.EmbeddingSizeField()
embeddings_on_cpu: bool = common_fields.EmbeddingsOnCPUField()
embeddings_trainable: bool = common_fields.EmbeddingsTrainableField()
pretrained_embeddings: str = common_fields.PretrainedEmbeddingsField()
reduce_output: str = common_fields.ReduceOutputField(default="last")
output_size: int = schema_utils.PositiveInteger(
default=256,
description="The default output_size that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["StackedRNN"]["output_size"],
)
norm: str = common_fields.NormField(description="The default norm that will be used for each layer.")
norm_params: dict = common_fields.NormParamsField()
num_fc_layers: int = common_fields.NumFCLayersField(description="Number of parallel fully connected layers to use.")
fc_activation: str = schema_utils.ActivationOptions()
fc_dropout: float = common_fields.DropoutField()
fc_layers: list[dict] = common_fields.FCLayersField()
@DeveloperAPI
@register_encoder_config("cnnrnn", [AUDIO, SEQUENCE, TEXT, TIMESERIES])
@ludwig_dataclass
class StackedCNNRNNConfig(SequenceEncoderConfig):
@staticmethod
def module_name():
return "StackedCNNRNN"
type: str = schema_utils.ProtectedString(
"cnnrnn",
description=ENCODER_METADATA["StackedCNNRNN"]["type"].long_description,
)
dropout: float = common_fields.DropoutField(description="Dropout rate.")
recurrent_dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="The dropout rate for the recurrent state",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["recurrent_dropout"],
)
conv_dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="The dropout rate for the convolutional layers",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["conv_dropout"],
)
activation: str = schema_utils.ActivationOptions(
default="tanh", description="The default activation function to use."
)
recurrent_activation: str = schema_utils.ActivationOptions(
default="sigmoid",
description="The activation function to use in the recurrent step",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["recurrent_activation"],
)
conv_activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each convolutional layer.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["conv_activation"],
)
max_sequence_length: int = common_fields.MaxSequenceLengthField()
representation: str = common_fields.RepresentationField()
vocab: list = common_fields.VocabField()
cell_type: str = schema_utils.StringOptions(
["rnn", "lstm", "gru"],
default="rnn",
description="The type of recurrent cell to use. Available values are: `rnn`, `lstm`, `gru`. For reference "
"about the differences between the cells please refer to "
"[torch.nn Recurrent Layers](https://pytorch.org/docs/stable/nn.html#recurrent-layers).",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["cell_type"],
)
num_conv_layers: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description=NUM_CONV_LAYERS_DESCRIPTION,
parameter_metadata=ENCODER_METADATA["conv_params"]["num_conv_layers"],
)
conv_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for conv_layers
default=None,
description=CONV_LAYERS_DESCRIPTION,
parameter_metadata=ENCODER_METADATA["conv_params"]["conv_layers"],
)
num_filters: int = NumFiltersField()
filter_size: int = FilterSizeField(default=5)
pool_function: str = PoolFunctionField()
pool_size: int = PoolSizeField(default=2)
strides: int = schema_utils.PositiveInteger(
default=1,
description="Stride length of the convolution.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["strides"],
)
padding: str = schema_utils.StringOptions(
["valid", "same"],
default="same",
description="Padding to use.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["padding"],
)
dilation_rate: int = schema_utils.PositiveInteger(
default=1,
description="Dilation rate to use for dilated convolution.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["dilation_rate"],
)
pool_strides: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Factor to scale down.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["pool_strides"],
)
pool_padding: str = schema_utils.StringOptions(
["valid", "same"],
default="same",
description="Padding to use.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["pool_padding"],
)
num_rec_layers: int = schema_utils.PositiveInteger(
default=1,
description="The number of stacked recurrent layers.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["num_rec_layers"],
)
state_size: int = schema_utils.PositiveInteger(
default=256,
description="The size of the state of the rnn.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["state_size"],
)
bidirectional: bool = schema_utils.Boolean(
default=False,
description="If true, two recurrent networks will perform encoding in the forward and backward direction and "
"their outputs will be concatenated.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["bidirectional"],
)
unit_forget_bias: bool = schema_utils.Boolean(
default=True,
description="If true, add 1 to the bias of the forget gate at initialization",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["unit_forget_bias"],
)
recurrent_initializer: str = schema_utils.InitializerOptions(
default="orthogonal",
description="The initializer for recurrent matrix weights",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["recurrent_initializer"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether to use a bias vector.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["use_bias"],
)
bias_initializer: str = common_fields.BiasInitializerField()
weights_initializer: str = common_fields.WeightsInitializerField()
should_embed: bool = schema_utils.Boolean(
default=True,
description="If True the input sequence is expected to be made of integers and will be mapped into embeddings",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["should_embed"],
)
embedding_size: int = common_fields.EmbeddingSizeField()
embeddings_on_cpu: bool = common_fields.EmbeddingsOnCPUField()
embeddings_trainable: bool = common_fields.EmbeddingsTrainableField()
pretrained_embeddings: str = common_fields.PretrainedEmbeddingsField()
reduce_output: str = common_fields.ReduceOutputField(default="last")
output_size: int = schema_utils.PositiveInteger(
default=256,
description="The default output_size that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["StackedCNNRNN"]["output_size"],
)
norm: str = common_fields.NormField(description="The default norm that will be used for each layer.")
norm_params: dict = common_fields.NormParamsField()
num_fc_layers: int = common_fields.NumFCLayersField(description="Number of parallel fully connected layers to use.")
fc_activation: str = schema_utils.ActivationOptions()
fc_dropout: float = common_fields.DropoutField()
fc_layers: list[dict] = common_fields.FCLayersField()
@DeveloperAPI
@register_encoder_config("transformer", [SEQUENCE, TEXT, TIMESERIES])
@ludwig_dataclass
class StackedTransformerConfig(SequenceEncoderConfig):
@staticmethod
def module_name():
return "StackedTransformer"
type: str = schema_utils.ProtectedString(
"transformer",
description=ENCODER_METADATA["StackedTransformer"]["type"].long_description,
)
dropout: float = common_fields.DropoutField(default=0.1, description="The dropout rate for the transformer block.")
max_sequence_length: int = common_fields.MaxSequenceLengthField()
representation: str = common_fields.RepresentationField()
vocab: list = common_fields.VocabField()
num_layers: int = schema_utils.PositiveInteger(
default=1,
description="The number of transformer layers.",
parameter_metadata=ENCODER_METADATA["StackedTransformer"]["num_layers"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=256,
description="The size of the hidden representation within the transformer block. It is usually the same as "
"the embedding_size, but if the two values are different, a projection layer will be added before "
"the first transformer block.",
parameter_metadata=ENCODER_METADATA["StackedTransformer"]["hidden_size"],
)
num_heads: int = schema_utils.PositiveInteger(
default=8,
description="Number of attention heads in each transformer block.",
parameter_metadata=ENCODER_METADATA["StackedTransformer"]["num_heads"],
)
transformer_output_size: int = schema_utils.PositiveInteger(
default=256,
description="Size of the fully connected layer after self attention in the transformer block. This is usually "
"the same as hidden_size and embedding_size.",
parameter_metadata=ENCODER_METADATA["StackedTransformer"]["transformer_output_size"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether to use a bias vector.",
parameter_metadata=ENCODER_METADATA["StackedTransformer"]["use_bias"],
)
bias_initializer: str = common_fields.BiasInitializerField()
weights_initializer: str = common_fields.WeightsInitializerField()
should_embed: bool = schema_utils.Boolean(
default=True,
description="If True the input sequence is expected to be made of integers and will be mapped into embeddings",
parameter_metadata=ENCODER_METADATA["StackedTransformer"]["should_embed"],
)
embedding_size: int = common_fields.EmbeddingSizeField()
embeddings_on_cpu: bool = common_fields.EmbeddingsOnCPUField()
embeddings_trainable: bool = common_fields.EmbeddingsTrainableField()
pretrained_embeddings: str = common_fields.PretrainedEmbeddingsField()
reduce_output: str = common_fields.ReduceOutputField(default="last")
output_size: int = schema_utils.PositiveInteger(
default=256,
description="The default output_size that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["StackedTransformer"]["output_size"],
)
norm: str = common_fields.NormField(description="The default norm that will be used for each layer.")
norm_params: dict = common_fields.NormParamsField()
num_fc_layers: int = common_fields.NumFCLayersField(description="Number of parallel fully connected layers to use.")
fc_activation: str = schema_utils.ActivationOptions()
fc_dropout: float = common_fields.DropoutField()
fc_layers: list[dict] = common_fields.FCLayersField()
================================================
FILE: ludwig/schema/encoders/set_encoders.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import SET
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_encoder_config("embed", SET)
@ludwig_dataclass
class SetSparseEncoderConfig(BaseEncoderConfig):
@staticmethod
def module_name():
return "SetSparseEncoder"
type: str = schema_utils.ProtectedString(
"embed",
description=ENCODER_METADATA["SetSparseEncoder"]["type"].long_description,
)
dropout: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Dropout probability for the embedding.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["dropout"],
)
activation: str = schema_utils.ActivationOptions(
description="The default activation function that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["activation"],
)
representation: str = schema_utils.StringOptions(
["dense", "sparse"],
default="dense",
description="The representation of the embedding. Either dense or sparse.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["representation"],
)
vocab: list[str] = schema_utils.List(
default=None,
description="Vocabulary of the encoder",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["vocab"],
)
use_bias: bool = schema_utils.Boolean(
default=True,
description="Whether the layer uses a bias vector.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["use_bias"],
)
bias_initializer: str = schema_utils.InitializerOptions(
default="zeros",
description="Initializer to use for the bias vector.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["bias_initializer"],
)
weights_initializer: str = schema_utils.InitializerOptions(
description="Initializer to use for the weights matrix.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["weights_initializer"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=50,
description="The maximum embedding size, the actual size will be min(vocabulary_size, embedding_size) for "
"dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size "
"is the number of different strings appearing in the training set in the input column (plus 1 for "
"the unknown token placeholder ).",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["embedding_size"],
)
embeddings_on_cpu: bool = schema_utils.Boolean(
default=False,
description="By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster "
"access, but in some cases the embedding matrix may be too large. This parameter forces the "
"placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, "
"slightly slowing down the process as a result of data transfer between CPU and GPU memory.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["embeddings_on_cpu"],
)
embeddings_trainable: bool = schema_utils.Boolean(
default=True,
description="If true embeddings are trained during the training process, if false embeddings are fixed. It "
"may be useful when loading pretrained embeddings for avoiding finetuning them. This parameter "
"has effect only when representation is dense as sparse one-hot encodings are not trainable.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["embeddings_trainable"],
)
pretrained_embeddings: str = schema_utils.String(
default=None,
allow_none=True,
description="By default dense embeddings are initialized randomly, but this parameter allows to specify a "
"path to a file containing embeddings in the GloVe format. When the file containing the "
"embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, "
"the others are discarded. If the vocabulary contains strings that have no match in the "
"embeddings file, their embeddings are initialized with the average of all other embedding plus "
"some random noise to make them different from each other. This parameter has effect only if "
"representation is dense.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["pretrained_embeddings"],
)
output_size: int = schema_utils.PositiveInteger(
default=10,
description="If output_size is not already specified in fc_layers this is the default output_size that will "
"be used for each layer. It indicates the size of the output of a fully connected layer.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["output_size"],
)
norm: str = schema_utils.StringOptions(
["batch", "layer"],
default=None,
allow_none=True,
description="The default norm that will be used for each layer.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["norm"],
)
norm_params: dict = schema_utils.Dict(
default=None,
description="Parameters used if norm is either `batch` or `layer`.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["norm_params"],
)
num_fc_layers: int = schema_utils.NonNegativeInteger(
default=0,
description="This is the number of stacked fully connected layers that the input to the feature passes "
"through. Their output is projected in the feature's output space.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["num_fc_layers"],
)
fc_layers: list[dict] = schema_utils.DictList( # TODO (Connor): Add nesting logic for fc_layers
default=None,
description="List of dictionaries containing the parameters for each fully connected layer.",
parameter_metadata=ENCODER_METADATA["SetSparseEncoder"]["fc_layers"],
)
================================================
FILE: ludwig/schema/encoders/text/__init__.py
================================================
================================================
FILE: ludwig/schema/encoders/text/encoders.py
================================================
from collections.abc import Callable
from typing import TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MODEL_ECD, TEXT
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.sequence_encoders import SequenceEncoderConfig
from ludwig.schema.encoders.text.hf_model_params import DebertaModelParams
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.llms.base_model import BaseModelDataclassField
from ludwig.schema.llms.model_parameters import ModelParametersConfig, ModelParametersConfigField
from ludwig.schema.llms.peft import AdapterDataclassField, BaseAdapterConfig
from ludwig.schema.llms.quantization import QuantizationConfig, QuantizationConfigField
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY, ParameterMetadata
from ludwig.schema.utils import ludwig_dataclass
if TYPE_CHECKING:
from ludwig.schema.features.preprocessing.text import TextPreprocessingConfig
class HFEncoderConfig(SequenceEncoderConfig):
trainable: bool
use_pretrained: bool
pretrained_model_name_or_path: str
reduce_output: str
def set_fixed_preprocessing_params(self, model_type: str, preprocessing: "TextPreprocessingConfig"):
model_name = self.pretrained_model_name_or_path
if model_name is None and self.use_pretrained:
# no default model name, so model name is required by the subclass
raise ValueError(
f"Missing required parameter for `{self.type}` encoder: `pretrained_model_name_or_path` when "
"`use_pretrained` is True."
)
preprocessing.tokenizer = "hf_tokenizer"
preprocessing.pretrained_model_name_or_path = model_name
if not self.can_cache_embeddings():
preprocessing.cache_encoder_embeddings = False
def is_pretrained(self) -> bool:
return self.use_pretrained
def can_cache_embeddings(self) -> bool:
"""Returns true if the encoder's output embeddings will not change during training."""
return not self.trainable and self.reduce_output != "attention"
@DeveloperAPI
@ludwig_dataclass
class HFEncoderImplConfig(HFEncoderConfig):
"""This dataclass configures the base HF encoder implmenetation."""
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["HFEncoder"]["use_pretrained"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["HFEncoder"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
)
# Internal params set based on preprocessing metadata
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="",
parameter_metadata=INTERNAL_ONLY,
)
vocab_size: int = schema_utils.PositiveInteger(
default=None,
description="",
parameter_metadata=INTERNAL_ONLY,
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description=(
"Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub."
),
parameter_metadata=INTERNAL_ONLY,
)
@DeveloperAPI
@register_encoder_config("albert", TEXT)
@ludwig_dataclass
class ALBERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an ALBERT encoder."""
@staticmethod
def module_name():
return "ALBERT"
type: str = schema_utils.ProtectedString(
"albert",
description=ENCODER_METADATA["ALBERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="albert-base-v2",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["reduce_output"],
)
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["ALBERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30000,
description="Vocabulary size of the ALBERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["vocab_size"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=128,
description="Dimensionality of vocabulary embeddings.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["embedding_size"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["hidden_size"],
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["num_hidden_layers"],
)
num_hidden_groups: int = schema_utils.PositiveInteger(
default=1,
description="Number of groups for the hidden layers, parameters in the same group are shared.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["num_hidden_groups"],
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["num_attention_heads"],
)
intermediate_size: int = schema_utils.PositiveInteger(
default=3072,
description="The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer "
"encoder.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["intermediate_size"],
)
inner_group_num: int = schema_utils.PositiveInteger(
default=1,
description="The number of inner repetition of attention and ffn.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["inner_group_num"],
)
hidden_act: str = schema_utils.StringOptions(
["gelu", "relu", "silu", "gelu_new"],
default="gelu_new",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["hidden_act"],
)
hidden_dropout_prob: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["hidden_dropout_prob"],
)
attention_probs_dropout_prob: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["attention_probs_dropout_prob"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["ALBERT"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=2,
description="The vocabulary size of the token_type_ids passed when calling AlbertModel or TFAlbertModel.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["type_vocab_size"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["layer_norm_eps"],
)
classifier_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for attached classifiers.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["classifier_dropout_prob"],
)
position_embedding_type: str = schema_utils.StringOptions(
["absolute", "relative_key", "relative_key_query"],
default="absolute",
description="",
parameter_metadata=ENCODER_METADATA["ALBERT"]["position_embedding_type"],
)
pad_token_id: int = schema_utils.Integer(
default=0,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=2,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["bos_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=3,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["eos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["pretrained_kwargs"],
)
# TODO: uncomment when sentencepiece doesn't cause segfaults: https://github.com/ludwig-ai/ludwig/issues/2983
@DeveloperAPI
# @register_encoder_config("mt5", TEXT)
@ludwig_dataclass
class MT5Config(HFEncoderConfig):
"""This dataclass configures the schema used for an MT5 encoder."""
@staticmethod
def module_name():
return "MT5"
type: str = schema_utils.ProtectedString(
"mt5",
description=ENCODER_METADATA["MT5"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["MT5"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["MT5"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="google/mt5-base",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["MT5"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["MT5"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["MT5"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["MT5"]["reduce_output"],
)
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["MT5"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=250112,
description="Vocabulary size of the T5 model. Defines the number of different tokens that can be represented "
"by the inputs_ids passed when calling T5Model or TFT5Model.",
parameter_metadata=ENCODER_METADATA["MT5"]["vocab_size"],
)
d_model: int = schema_utils.PositiveInteger(
default=512,
description="Size of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["MT5"]["d_model"],
)
d_kv: int = schema_utils.PositiveInteger(
default=64,
description="Size of the key, query, value projections per attention head. d_kv has to be equal to d_model // "
"num_heads.",
parameter_metadata=ENCODER_METADATA["MT5"]["d_kv"],
)
d_ff: int = schema_utils.PositiveInteger(
default=1024,
description="Size of the intermediate feed forward layer in each T5Block.",
parameter_metadata=ENCODER_METADATA["MT5"]["d_ff"],
)
num_layers: int = schema_utils.PositiveInteger(
default=8,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["MT5"]["num_layers"],
)
num_decoder_layers: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not "
"set.",
parameter_metadata=ENCODER_METADATA["MT5"]["num_decoder_layers"],
)
num_heads: int = schema_utils.PositiveInteger(
default=6,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["MT5"]["num_heads"],
)
relative_attention_num_buckets: int = schema_utils.PositiveInteger(
default=32,
description="The number of buckets to use for each attention layer.",
parameter_metadata=ENCODER_METADATA["MT5"]["relative_attention_num_buckets"],
)
dropout_rate: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The ratio for all dropout layers.",
parameter_metadata=ENCODER_METADATA["MT5"]["dropout_rate"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-06,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["MT5"]["layer_norm_epsilon"],
)
initializer_factor: float = schema_utils.NonNegativeFloat(
default=1.0,
description="A factor for initializing all weight matrices (should be kept to 1, used internally for "
"initialization testing)",
parameter_metadata=ENCODER_METADATA["MT5"]["initializer_factor"],
)
feed_forward_proj: str = schema_utils.StringOptions(
["relu", "gated-gelu"],
default="gated-gelu",
description="Type of feed forward layer to be used. ",
parameter_metadata=ENCODER_METADATA["MT5"]["feed_forward_proj"],
)
is_encoder_decoder: bool = schema_utils.Boolean(
default=True,
description="",
parameter_metadata=ENCODER_METADATA["MT5"]["is_encoder_decoder"],
)
use_cache: bool = schema_utils.Boolean(
default=True,
description="",
parameter_metadata=ENCODER_METADATA["MT5"]["use_cache"],
)
tokenizer_class: str = schema_utils.String(
default="T5Tokenizer",
description="",
parameter_metadata=ENCODER_METADATA["MT5"]["tokenizer_class"],
)
tie_word_embeddings: bool = schema_utils.Boolean(
default=False,
description="Whether the model's input and output word embeddings should be tied.",
parameter_metadata=ENCODER_METADATA["MT5"]["tie_word_embeddings"],
)
pad_token_id: int = schema_utils.Integer(
default=0,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["MT5"]["pad_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=1,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["MT5"]["eos_token_id"],
)
decoder_start_token_id: int = schema_utils.Integer(
default=0,
description="If an encoder-decoder model starts decoding with a different token than _bos_, the id of that "
"token.",
parameter_metadata=ENCODER_METADATA["MT5"]["decoder_start_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["MT5"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("xlmroberta", TEXT)
@ludwig_dataclass
class XLMRoBERTaConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an XLMRoBERTa encoder."""
@staticmethod
def module_name():
return "XLMRoBERTa"
type: str = schema_utils.ProtectedString(
"xlmroberta",
description=ENCODER_METADATA["XLMRoBERTa"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="xlm-roberta-base",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Vocabulary size of the XLMRoBERTa model.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["vocab_size"],
)
pad_token_id: int = schema_utils.Integer(
default=1,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=0,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["bos_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=2,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["eos_token_id"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=514,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=1,
description="The vocabulary size of the token_type_ids passed in.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["type_vocab_size"],
)
add_pooling_layer: bool = schema_utils.Boolean(
default=True,
description="Whether to add a pooling layer to the encoder.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["add_pooling_layer"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("bert", TEXT)
@ludwig_dataclass
class BERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an BERT encoder."""
@staticmethod
def module_name():
return "BERT"
type: str = schema_utils.ProtectedString(
"bert",
description=ENCODER_METADATA["BERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["BERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["BERT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="bert-base-uncased",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["BERT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["BERT"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["BERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["BERT"]["reduce_output"],
)
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["BERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30522,
description="Vocabulary size of the BERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling BertModel or TFBertModel.",
parameter_metadata=ENCODER_METADATA["BERT"]["vocab_size"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["BERT"]["hidden_size"],
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["BERT"]["num_hidden_layers"],
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["BERT"]["num_attention_heads"],
)
intermediate_size: int = schema_utils.PositiveInteger(
default=3072,
description="Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["BERT"]["intermediate_size"],
)
hidden_act: str | Callable = schema_utils.StringOptions( # TODO: add support for callable
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["BERT"]["hidden_act"],
)
hidden_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["BERT"]["hidden_dropout_prob"],
)
attention_probs_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["BERT"]["attention_probs_dropout_prob"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["BERT"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=2,
description="The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.",
parameter_metadata=ENCODER_METADATA["BERT"]["type_vocab_size"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["BERT"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["BERT"]["layer_norm_eps"],
)
pad_token_id: int = schema_utils.Integer(
default=0,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["BERT"]["pad_token_id"],
)
gradient_checkpointing: bool = schema_utils.Boolean(
default=False,
description="Whether to use gradient checkpointing.",
parameter_metadata=ENCODER_METADATA["BERT"]["gradient_checkpointing"],
)
position_embedding_type: str = schema_utils.StringOptions(
["absolute", "relative_key", "relative_key_query"],
default="absolute",
description="Type of position embedding.",
parameter_metadata=ENCODER_METADATA["BERT"]["position_embedding_type"],
)
classifier_dropout: float = schema_utils.FloatRange(
default=None,
allow_none=True,
min=0,
max=1,
description="The dropout ratio for the classification head.",
parameter_metadata=ENCODER_METADATA["BERT"]["classifier_dropout"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["BERT"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("deberta", TEXT)
@ludwig_dataclass
class DebertaV2Config(HFEncoderImplConfig, DebertaModelParams):
"""This dataclass configures the schema used for a DeBERTa-v2 / v3 encoder."""
@staticmethod
def module_name():
return "DeBERTa"
type: str = schema_utils.ProtectedString(
"deberta",
description=ENCODER_METADATA["DeBERTa"]["type"].long_description,
)
pretrained_model_name_or_path: str = schema_utils.String(
default="sileod/deberta-v3-base-tasksource-nli",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["DeBERTa"]["pretrained_model_name_or_path"],
)
reduce_output: str = schema_utils.StringOptions(
["cls_pooled", "last", "sum", "mean", "max", "concat", "attention"],
default="sum",
allow_none=True,
description="The method used to reduce a sequence of tensors down to a single tensor.",
)
# TODO: uncomment once we figure out host memory issue: https://github.com/ludwig-ai/ludwig/issues/3107
@DeveloperAPI
# @register_encoder_config("xlm", TEXT)
@ludwig_dataclass
class XLMConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an XLM encoder."""
@staticmethod
def module_name():
return "XLM"
type: str = schema_utils.ProtectedString(
"xlm",
description=ENCODER_METADATA["XLM"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["XLM"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["XLM"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="xlm-mlm-en-2048",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLM"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["XLM"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["XLM"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["XLM"]["reduce_output"],
)
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["XLM"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30145,
description="Vocabulary size of the BERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling XLMModel or TFXLMModel.",
parameter_metadata=ENCODER_METADATA["XLM"]["vocab_size"],
)
emb_dim: int = schema_utils.PositiveInteger(
default=2048,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["XLM"]["emb_dim"],
)
n_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLM"]["n_layers"],
)
n_heads: int = schema_utils.PositiveInteger(
default=16,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLM"]["n_heads"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["XLM"]["dropout"],
)
attention_dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for the attention mechanism.",
parameter_metadata=ENCODER_METADATA["XLM"]["attention_dropout"],
)
gelu_activation: bool = schema_utils.Boolean(
default=True,
description="Whether or not to use gelu for the activations instead of relu.",
parameter_metadata=ENCODER_METADATA["XLM"]["gelu_activation"],
)
sinusoidal_embeddings: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.",
parameter_metadata=ENCODER_METADATA["XLM"]["sinusoidal_embeddings"],
)
causal: bool = schema_utils.Boolean(
default=False,
description="Whether or not the model should behave in a causal manner. Causal models use a triangular "
"attention mask in order to only attend to the left-side context instead if a bidirectional "
"context.",
parameter_metadata=ENCODER_METADATA["XLM"]["causal"],
)
asm: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the "
"prediction layer.",
parameter_metadata=ENCODER_METADATA["XLM"]["asm"],
)
n_langs: int = schema_utils.PositiveInteger(
default=1,
description="The number of languages the model handles. Set to 1 for monolingual models.",
parameter_metadata=ENCODER_METADATA["XLM"]["n_langs"],
)
use_lang_emb: bool = schema_utils.Boolean(
default=True,
description="Whether to use language embeddings. Some models use additional language embeddings, "
"see the multilingual models page for information on how to use them.",
parameter_metadata=ENCODER_METADATA["XLM"]["use_lang_emb"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["XLM"]["max_position_embeddings"],
)
embed_init_std: float = schema_utils.NonNegativeFloat(
default=2048**-0.5,
description="The standard deviation of the truncated_normal_initializer for initializing the embedding "
"matrices.",
parameter_metadata=ENCODER_METADATA["XLM"]["embed_init_std"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["XLM"]["layer_norm_eps"],
)
init_std: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices "
"except the embedding matrices.",
parameter_metadata=ENCODER_METADATA["XLM"]["init_std"],
)
bos_index: int = schema_utils.NonNegativeInteger(
default=0,
description="The index of the beginning of sentence token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["bos_index"],
)
eos_index: int = schema_utils.NonNegativeInteger(
default=1,
description="The index of the end of sentence token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["eos_index"],
)
pad_index: int = schema_utils.NonNegativeInteger(
default=2,
description="The index of the padding token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["pad_index"],
)
unk_index: int = schema_utils.NonNegativeInteger(
default=3,
description="The index of the unknown token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["unk_index"],
)
mask_index: int = schema_utils.NonNegativeInteger(
default=5,
description="The index of the masking token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["mask_index"],
)
is_encoder: bool = schema_utils.Boolean(
default=True,
description="Whether or not the initialized model should be a transformer encoder or decoder as seen in "
"Vaswani et al.",
parameter_metadata=ENCODER_METADATA["XLM"]["is_encoder"],
)
start_n_top: int = schema_utils.PositiveInteger(
default=5,
description="Used in the SQuAD evaluation script.",
parameter_metadata=ENCODER_METADATA["XLM"]["start_n_top"],
)
end_n_top: int = schema_utils.PositiveInteger(
default=5,
description="Used in the SQuAD evaluation script.",
parameter_metadata=ENCODER_METADATA["XLM"]["end_n_top"],
)
mask_token_id: int = schema_utils.Integer(
default=0,
description="Model agnostic parameter to identify masked tokens when generating text in an MLM context.",
parameter_metadata=ENCODER_METADATA["XLM"]["mask_token_id"],
)
lang_id: int = schema_utils.Integer(
default=0,
description="The ID of the language used by the model. This parameter is used when generating text in a given "
"language.",
parameter_metadata=ENCODER_METADATA["XLM"]["lang_id"],
)
pad_token_id: int = schema_utils.Integer(
default=2,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["XLM"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=0,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLM"]["bos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLM"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("gpt", TEXT)
@ludwig_dataclass
class GPTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an GPT encoder."""
@staticmethod
def module_name():
return "GPT"
type: str = schema_utils.ProtectedString(
"gpt",
description=ENCODER_METADATA["GPT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["GPT"]["max_sequence_length"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["GPT"]["reduce_output"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["GPT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="openai-gpt",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["GPT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["GPT"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["GPT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["GPT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30522,
description="Vocabulary size of the GPT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling OpenAIGPTModel or TFOpenAIGPTModel.",
parameter_metadata=ENCODER_METADATA["GPT"]["vocab_size"],
)
n_positions: int = schema_utils.PositiveInteger(
default=40478,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["GPT"]["n_positions"],
)
n_ctx: int = schema_utils.PositiveInteger(
default=512,
description="Dimensionality of the causal mask (usually same as n_positions)",
parameter_metadata=ENCODER_METADATA["GPT"]["n_ctx"],
)
n_embd: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the embeddings and hidden states.",
parameter_metadata=ENCODER_METADATA["GPT"]["n_embd"],
)
n_layer: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["GPT"]["n_layer"],
)
n_head: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["GPT"]["n_head"],
)
afn: str = schema_utils.StringOptions(
["gelu", "relu", "silu"], # gelu_new results in a KeyError.
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["GPT"]["afn"],
)
resid_pdrop: float = schema_utils.FloatRange(
default=0.1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["GPT"]["resid_pdrop"],
)
embd_pdrop: float = schema_utils.FloatRange(
default=0.1,
description="The dropout ratio for the embeddings.",
parameter_metadata=ENCODER_METADATA["GPT"]["embd_pdrop"],
)
attn_pdrop: float = schema_utils.FloatRange(
default=0.1,
description="The dropout ratio for the attention.",
parameter_metadata=ENCODER_METADATA["GPT"]["attn_pdrop"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-5,
description="The epsilon to use in the layer normalization layers",
parameter_metadata=ENCODER_METADATA["GPT"]["layer_norm_epsilon"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["GPT"]["initializer_range"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["GPT"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("gpt2", TEXT)
@ludwig_dataclass
class GPT2Config(HFEncoderConfig):
"""This dataclass configures the schema used for an GPT2 encoder."""
@staticmethod
def module_name():
return "GPT2"
type: str = schema_utils.ProtectedString(
"gpt2",
description=ENCODER_METADATA["GPT2"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["GPT2"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["GPT2"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="gpt2",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["GPT2"]["pretrained_model_name_or_path"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["GPT2"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["GPT2"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["GPT2"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=50257,
description="Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling GPT2Model or TFGPT2Model.",
parameter_metadata=ENCODER_METADATA["GPT2"]["vocab_size"],
)
n_positions: int = schema_utils.PositiveInteger(
default=1024,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_positions"],
)
n_ctx: int = schema_utils.PositiveInteger(
default=1024,
description="Dimensionality of the causal mask (usually same as n_positions)",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_ctx"],
)
n_embd: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the embeddings and hidden states.",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_embd"],
)
n_layer: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_layer"],
)
n_head: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_head"],
)
n_inner: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embd",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_inner"],
)
activation_function: str = schema_utils.StringOptions(
["relu", "silu", "gelu", "tanh", "gelu_new"],
default="gelu_new",
description="Activation function, to be selected in the list ['relu', 'silu', 'gelu', 'tanh', 'gelu_new'].",
parameter_metadata=ENCODER_METADATA["GPT2"]["activation_function"],
)
resid_pdrop: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["GPT2"]["resid_pdrop"],
)
embd_pdrop: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the embeddings.",
parameter_metadata=ENCODER_METADATA["GPT2"]["embd_pdrop"],
)
attn_pdrop: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the attention.",
parameter_metadata=ENCODER_METADATA["GPT2"]["attn_pdrop"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-5,
description="The epsilon to use in the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["GPT2"]["layer_norm_epsilon"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["GPT2"]["initializer_range"],
)
scale_attn_weights: bool = schema_utils.Boolean(
default=True,
description="Scale attention weights by dividing by sqrt(hidden_size).",
parameter_metadata=ENCODER_METADATA["GPT2"]["scale_attn_weights"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["GPT2"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("roberta", TEXT)
@ludwig_dataclass
class RoBERTaConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an RoBERTa encoder."""
@staticmethod
def module_name():
return "RoBERTa"
type: str = schema_utils.ProtectedString(
"roberta",
description=ENCODER_METADATA["RoBERTa"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="roberta-base",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Vocabulary size of the RoBERTa model.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["vocab_size"],
)
pad_token_id: int = schema_utils.Integer(
default=1,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=0,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["bos_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=2,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["eos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("transformer_xl", TEXT)
@ludwig_dataclass
class TransformerXLConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an TransformerXL encoder."""
@staticmethod
def module_name():
return "TransformerXL"
type: str = schema_utils.ProtectedString(
"transformer_xl",
description=ENCODER_METADATA["TransformerXL"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="transfo-xl-wt103",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=267735,
description="Vocabulary size of the TransfoXL model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling TransfoXLModel or TFTransfoXLModel.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["vocab_size"],
)
cutoffs: list[int] = schema_utils.List(
int,
default=[20000, 40000, 200000],
description="Cutoffs for the adaptive softmax.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["cutoffs"],
)
d_model: int = schema_utils.PositiveInteger(
default=1024,
description="Dimensionality of the model’s hidden states.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["d_model"],
)
d_embed: int = schema_utils.PositiveInteger(
default=1024,
description="Dimensionality of the embeddings",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["d_embed"],
)
n_head: int = schema_utils.PositiveInteger(
default=16,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["n_head"],
)
d_head: int = schema_utils.PositiveInteger(
default=64,
description="Dimensionality of the model’s heads.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["d_head"],
)
d_inner: int = schema_utils.PositiveInteger(
default=4096,
description=" Inner dimension in FF",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["d_inner"],
)
div_val: int = schema_utils.PositiveInteger(
default=4,
description="Divident value for adapative input and softmax.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["div_val"],
)
pre_lnorm: bool = schema_utils.Boolean(
default=False,
description="Whether or not to apply LayerNorm to the input instead of the output in the blocks.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["pre_lnorm"],
)
n_layer: int = schema_utils.PositiveInteger(
default=18,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["n_layer"],
)
mem_len: int = schema_utils.PositiveInteger(
default=1600,
description="Length of the retained previous heads.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["mem_len"],
)
clamp_len: int = schema_utils.PositiveInteger(
default=1000,
description="Use the same pos embeddings after clamp_len.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["clamp_len"],
)
same_length: bool = schema_utils.Boolean(
default=True,
description="Whether or not to use the same attn length for all tokens",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["same_length"],
)
proj_share_all_but_first: bool = schema_utils.Boolean(
default=True,
description="True to share all but first projs, False not to share.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["proj_share_all_but_first"],
)
attn_type: int = schema_utils.IntegerRange(
default=0,
min=0,
max=3,
description="Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["attn_type"],
)
sample_softmax: int = schema_utils.Integer(
default=-1,
description="Number of samples in the sampled softmax.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["sample_softmax"],
)
adaptive: bool = schema_utils.Boolean(
default=True,
description="Whether or not to use adaptive softmax.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["adaptive"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["dropout"],
)
dropatt: float = schema_utils.NonNegativeFloat(
default=0.0,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["dropatt"],
)
untie_r: bool = schema_utils.Boolean(
default=True,
description="Whether ot not to untie relative position biases.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["untie_r"],
)
init: str = schema_utils.String(
default="normal",
description="Parameter initializer to use.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["init"],
)
init_range: float = schema_utils.NonNegativeFloat(
default=0.01,
description="Parameters initialized by U(-init_range, init_range).",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["init_range"],
)
proj_init_std: float = schema_utils.NonNegativeFloat(
default=0.01,
description="Parameters initialized by N(0, init_std)",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["proj_init_std"],
)
init_std: float = schema_utils.NonNegativeFloat(
default=0.02,
description="Parameters initialized by N(0, init_std)",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["init_std"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-5,
description="The epsilon to use in the layer normalization layers",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["layer_norm_epsilon"],
)
eos_token_id: int = schema_utils.Integer(
default=0,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["eos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("xlnet", TEXT)
@ludwig_dataclass
class XLNetConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an XLNet encoder."""
@staticmethod
def module_name():
return "XLNet"
type: str = schema_utils.ProtectedString(
"xlnet",
description=ENCODER_METADATA["XLNet"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["XLNet"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["XLNet"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="xlnet-base-cased",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLNet"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["XLNet"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["XLNet"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["XLNet"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["XLNet"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=32000,
description="Vocabulary size of the XLNet model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling XLNetModel or TFXLNetModel.",
parameter_metadata=ENCODER_METADATA["XLNet"]["vocab_size"],
)
d_model: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["XLNet"]["d_model"],
)
n_layer: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLNet"]["n_layer"],
)
n_head: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLNet"]["n_head"],
)
d_inner: int = schema_utils.PositiveInteger(
default=3072,
description="Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLNet"]["d_inner"],
)
ff_activation: str = schema_utils.StringOptions(
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler. If string, "
"'gelu', 'relu', 'silu' and 'gelu_new' are supported.",
parameter_metadata=ENCODER_METADATA["XLNet"]["ff_activation"],
)
untie_r: bool = schema_utils.Boolean(
default=True,
description="Whether or not to untie relative position biases",
parameter_metadata=ENCODER_METADATA["XLNet"]["untie_r"],
)
attn_type: str = schema_utils.StringOptions(
["bi"],
default="bi",
description="The attention type used by the model. Currently only 'bi' is supported.",
parameter_metadata=ENCODER_METADATA["XLNet"]["attn_type"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["XLNet"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["XLNet"]["layer_norm_eps"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["XLNet"]["dropout"],
)
mem_len: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The number of tokens to cache. The key/value pairs that have already been pre-computed in a "
"previous forward pass won’t be re-computed. ",
parameter_metadata=ENCODER_METADATA["XLNet"]["mem_len"],
)
reuse_len: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The number of tokens in the current batch to be cached and reused in the future.",
parameter_metadata=ENCODER_METADATA["XLNet"]["reuse_len"],
)
use_mems_eval: bool = schema_utils.Boolean(
default=True,
description="Whether or not the model should make use of the recurrent memory mechanism in evaluation mode.",
parameter_metadata=ENCODER_METADATA["XLNet"]["use_mems_eval"],
)
use_mems_train: bool = schema_utils.Boolean(
default=False,
description="Whether or not the model should make use of the recurrent memory mechanism in train mode.",
parameter_metadata=ENCODER_METADATA["XLNet"]["use_mems_train"],
)
bi_data: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use bidirectional input pipeline. Usually set to True during pretraining and "
"False during finetuning.",
parameter_metadata=ENCODER_METADATA["XLNet"]["bi_data"],
)
clamp_len: int = schema_utils.Integer(
default=-1,
description="Clamp all relative distances larger than clamp_len. Setting this attribute to -1 means no "
"clamping.",
parameter_metadata=ENCODER_METADATA["XLNet"]["clamp_len"],
)
same_length: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use the same attention length for each token.",
parameter_metadata=ENCODER_METADATA["XLNet"]["same_length"],
)
summary_type: str = schema_utils.StringOptions(
["last", "first", "mean", "cls_index", "attn"],
default="last",
description="Argument used when doing sequence summary. Used in the sequence classification and multiple "
"choice models.",
parameter_metadata=ENCODER_METADATA["XLNet"]["summary_type"],
)
summary_use_proj: bool = schema_utils.Boolean(
default=True,
description="",
parameter_metadata=ENCODER_METADATA["XLNet"]["summary_use_proj"],
)
summary_activation: str = schema_utils.String(
default="tanh",
description="Argument used when doing sequence summary. Used in the sequence classification and multiple "
"choice models.",
parameter_metadata=ENCODER_METADATA["XLNet"]["summary_activation"],
)
summary_last_dropout: float = schema_utils.FloatRange(
default=0.1,
description="Used in the sequence classification and multiple choice models.",
parameter_metadata=ENCODER_METADATA["XLNet"]["summary_last_dropout"],
)
start_n_top: int = schema_utils.PositiveInteger(
default=5,
description="Used in the SQuAD evaluation script.",
parameter_metadata=ENCODER_METADATA["XLNet"]["start_n_top"],
)
end_n_top: int = schema_utils.PositiveInteger(
default=5,
description=" Used in the SQuAD evaluation script.",
parameter_metadata=ENCODER_METADATA["XLNet"]["end_n_top"],
)
pad_token_id: int = schema_utils.Integer(
default=5,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["XLNet"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=1,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLNet"]["bos_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=2,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLNet"]["eos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLNet"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("distilbert", TEXT)
@ludwig_dataclass
class DistilBERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an DistilBERT encoder."""
@staticmethod
def module_name():
return "DistilBERT"
type: str = schema_utils.ProtectedString(
"distilbert",
description=ENCODER_METADATA["DistilBERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="distilbert-base-uncased",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30522,
description="Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling DistilBertModel or TFDistilBertModel.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["vocab_size"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["dropout"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["max_position_embeddings"],
)
sinusoidal_pos_embds: bool = schema_utils.Boolean(
default=False,
description="Whether to use sinusoidal positional embeddings.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["sinusoidal_pos_embds"],
)
n_layers: int = schema_utils.PositiveInteger(
default=6,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["n_layers"],
)
n_heads: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["n_heads"],
)
dim: int = schema_utils.PositiveInteger(
default=768,
description=" Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["dim"],
)
hidden_dim: int = schema_utils.PositiveInteger(
default=3072,
description="The size of the “intermediate” (often named feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["hidden_dim"],
)
attention_dropout: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["attention_dropout"],
)
activation: str | Callable = schema_utils.StringOptions( # TODO: Add support for callable
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler. If string, "
"'gelu', 'relu', 'silu' and 'gelu_new' are supported.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["activation"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["initializer_range"],
)
qa_dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probabilities used in the question answering model DistilBertForQuestionAnswering.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["qa_dropout"],
)
seq_classif_dropout: float = schema_utils.FloatRange(
default=0.2,
min=0,
max=1,
description="The dropout probabilities used in the sequence classification and the multiple choice model "
"DistilBertForSequenceClassification.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["seq_classif_dropout"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["pretrained_kwargs"],
)
# TODO: uncomment when CTRL bug (https://github.com/ludwig-ai/ludwig/issues/2977) has been fixed to add back in
@DeveloperAPI
# @register_encoder_config("ctrl", TEXT)
@ludwig_dataclass
class CTRLConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an CTRL encoder."""
@staticmethod
def module_name():
return "CTRL"
type: str = schema_utils.ProtectedString(
"ctrl",
description=ENCODER_METADATA["CTRL"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["CTRL"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["CTRL"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="ctrl",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["CTRL"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["CTRL"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["CTRL"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["CTRL"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["CTRL"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=246534,
description="Vocabulary size of the CTRL model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling CTRLModel or TFCTRLModel.",
parameter_metadata=ENCODER_METADATA["CTRL"]["vocab_size"],
)
n_positions: int = schema_utils.PositiveInteger(
default=256,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_positions"],
)
n_ctx: int = schema_utils.PositiveInteger(
default=256,
description="Dimensionality of the causal mask (usually same as n_positions)",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_ctx"],
)
n_embd: int = schema_utils.PositiveInteger(
default=1280,
description="Dimensionality of the embeddings and hidden states.",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_embd"],
)
dff: int = schema_utils.PositiveInteger(
default=8192,
description="Dimensionality of the inner dimension of the feed forward networks (FFN).",
parameter_metadata=ENCODER_METADATA["CTRL"]["dff"],
)
n_layer: int = schema_utils.PositiveInteger(
default=48,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_layer"],
)
n_head: int = schema_utils.PositiveInteger(
default=16,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_head"],
)
resid_pdrop: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description=" The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["CTRL"]["resid_pdrop"],
)
embd_pdrop: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout ratio for the embeddings.",
parameter_metadata=ENCODER_METADATA["CTRL"]["embd_pdrop"],
)
attn_pdrop: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout ratio for the attention.",
parameter_metadata=ENCODER_METADATA["CTRL"]["attn_pdrop"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-6,
description="The epsilon to use in the layer normalization layers",
parameter_metadata=ENCODER_METADATA["CTRL"]["layer_norm_epsilon"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["CTRL"]["initializer_range"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["CTRL"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("camembert", TEXT)
@ludwig_dataclass
class CamemBERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an CamemBERT encoder."""
@staticmethod
def module_name():
return "CamemBERT"
type: str = schema_utils.ProtectedString(
"camembert",
description=ENCODER_METADATA["CamemBERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["use_pretrained"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["saved_weights_in_checkpoint"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="camembert-base",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["pretrained_model_name_or_path"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=32005,
description="Vocabulary size of the CamemBERT model.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["vocab_size"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["hidden_size"],
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["num_hidden_layers"],
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["num_attention_heads"],
)
intermediate_size: int = schema_utils.PositiveInteger(
default=3072,
description="Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["intermediate_size"],
)
hidden_act: str | Callable = schema_utils.StringOptions( # TODO: add support for callable
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["hidden_act"],
)
hidden_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["hidden_dropout_prob"],
)
attention_probs_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["attention_probs_dropout_prob"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=514,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=1,
description="The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["type_vocab_size"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-05,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["layer_norm_eps"],
)
pad_token_id: int = schema_utils.Integer(
default=1,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["pad_token_id"],
)
gradient_checkpointing: bool = schema_utils.Boolean(
default=False,
description="Whether to use gradient checkpointing.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["gradient_checkpointing"],
)
position_embedding_type: str = schema_utils.StringOptions(
["absolute", "relative_key", "relative_key_query"],
default="absolute",
description="Type of position embedding.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["position_embedding_type"],
)
classifier_dropout: float = schema_utils.FloatRange(
default=None,
allow_none=True,
min=0,
max=1,
description="The dropout ratio for the classification head.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["classifier_dropout"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("t5", TEXT)
@ludwig_dataclass
class T5Config(HFEncoderConfig):
"""This dataclass configures the schema used for an T5 encoder."""
@staticmethod
def module_name():
return "T5"
type: str = schema_utils.ProtectedString(
"t5",
description=ENCODER_METADATA["T5"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["T5"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["T5"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="t5-small",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["T5"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["T5"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["T5"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["T5"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["T5"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=32128,
description="Vocabulary size of the T5 model. Defines the number of different tokens that can be represented "
"by the inputs_ids passed when calling T5Model or TFT5Model.",
parameter_metadata=ENCODER_METADATA["T5"]["vocab_size"],
)
d_model: int = schema_utils.PositiveInteger(
default=512,
description="Size of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["T5"]["d_model"],
)
d_kv: int = schema_utils.PositiveInteger(
default=64,
description="Size of the key, query, value projections per attention head. d_kv has to be equal to d_model // "
"num_heads.",
parameter_metadata=ENCODER_METADATA["T5"]["d_kv"],
)
d_ff: int = schema_utils.PositiveInteger(
default=2048,
description="Size of the intermediate feed forward layer in each T5Block.",
parameter_metadata=ENCODER_METADATA["T5"]["d_ff"],
)
num_layers: int = schema_utils.PositiveInteger(
default=6,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["T5"]["num_layers"],
)
num_decoder_layers: int = schema_utils.PositiveInteger(
default=6,
description="Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not "
"set.",
parameter_metadata=ENCODER_METADATA["T5"]["num_decoder_layers"],
)
num_heads: int = schema_utils.PositiveInteger(
default=8,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["T5"]["num_heads"],
)
relative_attention_num_buckets: int = schema_utils.PositiveInteger(
default=32,
description="The number of buckets to use for each attention layer.",
parameter_metadata=ENCODER_METADATA["T5"]["relative_attention_num_buckets"],
)
dropout_rate: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The ratio for all dropout layers.",
parameter_metadata=ENCODER_METADATA["T5"]["dropout_rate"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-6,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["T5"]["layer_norm_eps"],
)
initializer_factor: float = schema_utils.NonNegativeFloat(
default=1,
description="A factor for initializing all weight matrices (should be kept to 1, used internally for "
"initialization testing).",
parameter_metadata=ENCODER_METADATA["T5"]["initializer_factor"],
)
feed_forward_proj: str = schema_utils.StringOptions(
["relu", "gated-gelu"],
default="relu",
description="Type of feed forward layer to be used. Should be one of 'relu' or 'gated-gelu'. T5v1.1 uses the "
"'gated-gelu' feed forward projection. Original T5 uses 'relu'.",
parameter_metadata=ENCODER_METADATA["T5"]["feed_forward_proj"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["T5"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("flaubert", TEXT)
@ludwig_dataclass
class FlauBERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an FlauBERT encoder."""
@staticmethod
def module_name():
return "FlauBERT"
type: str = schema_utils.ProtectedString(
"flaubert",
description=ENCODER_METADATA["FlauBERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="flaubert/flaubert_small_cased",
description="Name of path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30145,
description="Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling FlaubertModel or TFFlaubertModel.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["vocab_size"],
)
pre_norm: bool = schema_utils.Boolean(
default=True,
description="Whether to apply the layer normalization before or after the feed forward layer following the "
"attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["pre_norm"],
)
layerdrop: float = schema_utils.FloatRange(
default=0.2,
min=0,
max=1,
description="Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand "
"with Structured Dropout. ICLR 2020)",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["layerdrop"],
)
emb_dim: int = schema_utils.PositiveInteger(
default=512,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["emb_dim"],
)
n_layers: int = schema_utils.PositiveInteger(
default=6,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["n_layers"],
)
n_heads: int = schema_utils.PositiveInteger(
default=8,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["n_heads"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["dropout"],
)
attention_dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for the attention mechanism",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["attention_dropout"],
)
gelu_activation: bool = schema_utils.Boolean(
default=True,
description="Whether or not to use a gelu activation instead of relu.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["gelu_activation"],
)
sinusoidal_embeddings: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["sinusoidal_embeddings"],
)
causal: bool = schema_utils.Boolean(
default=False,
description="Whether or not the model should behave in a causal manner. Causal models use a triangular "
"attention mask in order to only attend to the left-side context instead if a bidirectional "
"context.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["causal"],
)
asm: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the "
"prediction layer.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["asm"],
)
n_langs: int = schema_utils.PositiveInteger(
default=1,
description="The number of languages the model handles. Set to 1 for monolingual models.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["n_langs"],
)
use_lang_emb: bool = schema_utils.Boolean(
default=True,
description="Whether to use language embeddings. Some models use additional language embeddings, "
"see the multilingual models page for information on how to use them.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["use_lang_emb"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["max_position_embeddings"],
)
embed_init_std: float = schema_utils.NonNegativeFloat(
default=2048**-0.5,
description="The standard deviation of the truncated_normal_initializer for initializing the embedding "
"matrices.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["embed_init_std"],
)
init_std: int = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices "
"except the embedding matrices.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["init_std"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-06,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["layer_norm_eps"],
)
bos_index: int = schema_utils.NonNegativeInteger(
default=0,
description="The index of the beginning of sentence token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["bos_index"],
)
eos_index: int = schema_utils.NonNegativeInteger(
default=1,
description="The index of the end of sentence token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["eos_index"],
)
pad_index: int = schema_utils.NonNegativeInteger(
default=2,
description="The index of the padding token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["pad_index"],
)
unk_index: int = schema_utils.NonNegativeInteger(
default=3,
description="The index of the unknown token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["unk_index"],
)
mask_index: int = schema_utils.NonNegativeInteger(
default=5,
description="The index of the masking token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["mask_index"],
)
is_encoder: bool = schema_utils.Boolean(
default=True,
description="Whether or not the initialized model should be a transformer encoder or decoder as seen in "
"Vaswani et al.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["is_encoder"],
)
mask_token_id: int = schema_utils.Integer(
default=0,
description="Model agnostic parameter to identify masked tokens when generating text in an MLM context.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["mask_token_id"],
)
lang_id: int = schema_utils.Integer(
default=0,
description="The ID of the language used by the model. This parameter is used when generating text in a given "
"language.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["lang_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("electra", TEXT)
@ludwig_dataclass
class ELECTRAConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an ELECTRA encoder."""
@staticmethod
def module_name():
return "ELECTRA"
type: str = schema_utils.ProtectedString(
"electra",
description=ENCODER_METADATA["ELECTRA"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="google/electra-small-discriminator",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30522,
description="Vocabulary size of the ELECTRA model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling ElectraModel or TFElectraModel.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["vocab_size"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=128,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["embedding_size"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=256,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["hidden_size"],
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["num_hidden_layers"],
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=4,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["num_attention_heads"],
)
intermediate_size: int = schema_utils.PositiveInteger(
default=1024,
description="Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["intermediate_size"],
)
hidden_act: str | Callable = schema_utils.StringOptions( # TODO: add support for callable
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["hidden_act"],
)
hidden_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["hidden_dropout_prob"],
)
attention_probs_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["attention_probs_dropout_prob"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=2,
description="The vocabulary size of the token_type_ids passed when calling ElectraModel or TFElectraModel.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["type_vocab_size"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["layer_norm_eps"],
)
position_embedding_type: str = schema_utils.StringOptions(
["absolute", "relative_key", "relative_key_query"],
default="absolute",
description="Type of position embedding.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["position_embedding_type"],
)
classifier_dropout: float = schema_utils.FloatRange(
default=None,
allow_none=True,
min=0,
max=1,
description="The dropout ratio for the classification head.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["classifier_dropout"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("longformer", TEXT)
@ludwig_dataclass
class LongformerConfig(HFEncoderConfig):
"""This dataclass configures the schema used for a Longformer encoder."""
@staticmethod
def module_name():
return "Longformer"
type: str = schema_utils.ProtectedString(
"longformer",
description=ENCODER_METADATA["Longformer"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["Longformer"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["Longformer"]["use_pretrained"],
)
attention_window: list[int] | int = schema_utils.OneOfOptionsField(
default=512,
allow_none=False,
description="Size of an attention window around each token. If an int, use the same size for all layers. To "
"specify a different window size for each layer, use a List[int] where len(attention_window) == "
"num_hidden_layers.",
field_options=[
schema_utils.PositiveInteger(allow_none=False, description="", default=512),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Longformer"]["attention_window"],
)
sep_token_id: int = schema_utils.Integer(
default=2,
description="ID of the separator token, which is used when building a sequence from multiple sequences",
parameter_metadata=ENCODER_METADATA["Longformer"]["sep_token_id"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="allenai/longformer-base-4096",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["Longformer"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ParameterMetadata(internal_only=True),
)
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["Longformer"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["Longformer"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["Longformer"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=50265,
description="Vocabulary size of the Longformer model.",
parameter_metadata=ENCODER_METADATA["Longformer"]["vocab_size"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=4098,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["Longformer"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=1,
description="The vocabulary size of the token_type_ids passed when calling LongformerEncoder",
parameter_metadata=ENCODER_METADATA["Longformer"]["type_vocab_size"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["Longformer"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("auto_transformer", TEXT)
@ludwig_dataclass
class AutoTransformerConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an AutoTransformer encoder."""
def __post_init__(self):
if self.pretrained_model_name_or_path is None:
raise ConfigValidationError(
"`pretrained_model_name_or_path` must be specified for encoder: `auto_transformer`."
)
@staticmethod
def module_name():
return "AutoTransformer"
@property
def use_pretrained(self) -> bool:
# Always set this to True since we always want to use the pretrained weights
# We don't currently support training from scratch for AutoTransformers
return True
type: str = schema_utils.ProtectedString(
"auto_transformer",
description=ENCODER_METADATA["AutoTransformer"]["type"].long_description,
)
pretrained_model_name_or_path: str = schema_utils.String(
default=None,
allow_none=True,
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["pretrained_model_name_or_path"],
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["max_sequence_length"],
)
reduce_output: str = schema_utils.ReductionOptions(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description=(
"Vocabulary size of the AutoTransformer model. If None, the vocab size will be inferred "
"from the given pretrained model"
),
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["vocab_size"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("tf_idf", TEXT, model_types=[MODEL_ECD])
@ludwig_dataclass
class TfIdfEncoderConfig(SequenceEncoderConfig):
type: str = schema_utils.ProtectedString("tf_idf")
max_sequence_length: int = schema_utils.Integer(default=None, allow_none=True, parameter_metadata=INTERNAL_ONLY)
str2idf: dict[str, int] = schema_utils.Dict(parameter_metadata=INTERNAL_ONLY)
vocab: list = schema_utils.List(default=None, parameter_metadata=INTERNAL_ONLY)
vocab_size: int = schema_utils.Integer(default=None, allow_none=True, parameter_metadata=INTERNAL_ONLY)
def set_fixed_preprocessing_params(self, model_type: str, preprocessing: "TextPreprocessingConfig"):
preprocessing.compute_idf = True
def can_cache_embeddings(self) -> bool:
return True
@DeveloperAPI
@register_encoder_config("llm", TEXT, model_types=[MODEL_ECD])
@ludwig_dataclass
class LLMEncoderConfig(SequenceEncoderConfig):
type: str = schema_utils.ProtectedString("llm")
base_model: str = BaseModelDataclassField()
max_sequence_length: int = schema_utils.Integer(default=None, allow_none=True, parameter_metadata=INTERNAL_ONLY)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
quantization: QuantizationConfig | None = QuantizationConfigField().get_default_field()
model_parameters: ModelParametersConfig | None = ModelParametersConfigField().get_default_field()
================================================
FILE: ludwig/schema/encoders/text/hf_model_params.py
================================================
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import ludwig_dataclass
"""
NOTE TO DEVELOPERS: the implementation of the schema classes below must match the parameters of the HF PretrainedConfig
class exactly. This is because we convert this object into the matching HF PretrainedConfig object before passing it to
the model. Additionally, for loading and saving pretrained models, we take the config from the existing model and load
it into this config before saving. As such, if any params needed by the pretrained model are missing, we will not be
able to load checkpoints correctly.
A common mistake is to look at the PretrainedConfig __init__ method params and ignore any additional **kwargs. In some
cases, these kwargs are used to set additional params on the config object. For example, the DebertaConfig class has
`position_buckets` as a kwarg param, but it nonetheless requires this to construct the model architecture.
To debug issues with missing parameters, try printing out the `model.config` of the pretrained transformer and check
for any params it includes that are not present in your schema config.
"""
@ludwig_dataclass
class DebertaModelParams(schema_utils.BaseMarshmallowConfig):
@classmethod
def get_hf_config_param_names(cls) -> set[str]:
return DebertaModelParams.get_valid_field_names()
# Model architecture params for training from scratch
# TODO(travis): conditionally disable setting these when `use_pretrained=True`.
vocab_size: int = schema_utils.PositiveInteger(
default=None,
description="",
parameter_metadata=INTERNAL_ONLY,
)
hidden_size: int = schema_utils.PositiveInteger(
default=1536,
description="Dimensionality of the encoder layers and the pooler layer.",
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=24,
description="Number of hidden layers in the Transformer encoder.",
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=24,
description="Number of attention heads for each attention layer in the Transformer encoder.",
)
intermediate_size: int = schema_utils.PositiveInteger(
default=6144,
description="Dimensionality of the 'intermediate' (often named feed-forward) layer in the Transformer encoder.",
)
hidden_act: str = schema_utils.StringOptions(
options=["gelu", "relu", "silu", "tanh", "gelu_fast", "mish", "linear", "sigmoid", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler.",
)
hidden_dropout_prob: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
)
attention_probs_dropout_prob: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout ratio for the attention probabilities.",
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description=(
"The maximum sequence length that this model might ever be used with. Typically set this to something "
"large just in case (e.g., 512 or 1024 or 2048)."
),
)
type_vocab_size: int = schema_utils.NonNegativeInteger(
default=0,
description=("The vocabulary size of the `token_type_ids`."),
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description=(
"The standard deviation of the truncated_normal_initializer for initializing all weight matrices."
),
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-7,
description="The epsilon used by the layer normalization layers.",
)
relative_attention: bool = schema_utils.Boolean(
default=True,
description="Whether use relative position encoding.",
)
max_relative_positions: int = schema_utils.Integer(
default=-1,
description=(
"The range of relative positions `[-max_position_embeddings, max_position_embeddings]`. Use the same "
"value as `max_position_embeddings`."
),
)
pad_token_id: int = schema_utils.Integer(
default=0,
description="The value used to pad input_ids.",
)
position_biased_input: bool = schema_utils.Boolean(
default=False,
description="Whether add absolute position embedding to content embedding.",
)
pos_att_type: list[str] = schema_utils.List(
default=["p2c", "c2p"],
description=(
"The type of relative position attention, it can be a combination of `['p2c', 'c2p']`, e.g. `['p2c']`, "
"`['p2c', 'c2p']`, `['p2c', 'c2p']`."
),
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
)
pooler_hidden_size: int = schema_utils.PositiveInteger(
default=1536,
description="The hidden size of the pooler layers.",
)
pooler_dropout: float = schema_utils.NonNegativeFloat(
default=0,
description="The dropout ratio for the pooler layers.",
)
pooler_hidden_act: str = schema_utils.StringOptions(
options=["gelu", "relu", "silu", "tanh", "gelu_fast", "mish", "linear", "sigmoid", "gelu_new"],
default="gelu",
description="The activation function (function or string) in the pooler.",
)
position_buckets: int = schema_utils.PositiveInteger(
default=256,
description="The number of buckets to use for each attention layer.",
)
share_att_key: bool = schema_utils.Boolean(
default=True,
description="Whether to share attention key across layers.",
)
norm_rel_ebd: str = schema_utils.StringOptions(
options=["layer_norm", "none"],
default="layer_norm",
description="The normalization method for relative embeddings.",
)
================================================
FILE: ludwig/schema/encoders/text_encoders.py
================================================
from collections.abc import Callable
from typing import TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MODEL_ECD, TEXT
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.sequence_encoders import SequenceEncoderConfig
from ludwig.schema.encoders.text.hf_model_params import DebertaModelParams
from ludwig.schema.encoders.utils import register_encoder_config
from ludwig.schema.llms.base_model import BaseModelDataclassField
from ludwig.schema.llms.model_parameters import ModelParametersConfig, ModelParametersConfigField
from ludwig.schema.llms.peft import AdapterDataclassField, BaseAdapterConfig
from ludwig.schema.llms.quantization import QuantizationConfig, QuantizationConfigField
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY, ParameterMetadata
from ludwig.schema.utils import ludwig_dataclass
if TYPE_CHECKING:
from ludwig.schema.features.preprocessing.text import TextPreprocessingConfig
class HFEncoderConfig(SequenceEncoderConfig):
trainable: bool
use_pretrained: bool
pretrained_model_name_or_path: str
reduce_output: str
def set_fixed_preprocessing_params(self, model_type: str, preprocessing: "TextPreprocessingConfig"):
model_name = self.pretrained_model_name_or_path
if model_name is None and self.use_pretrained:
# no default model name, so model name is required by the subclass
raise ValueError(
f"Missing required parameter for `{self.type}` encoder: `pretrained_model_name_or_path` when "
"`use_pretrained` is True."
)
preprocessing.tokenizer = "hf_tokenizer"
preprocessing.pretrained_model_name_or_path = model_name
if not self.can_cache_embeddings():
preprocessing.cache_encoder_embeddings = False
def is_pretrained(self) -> bool:
return self.use_pretrained
def can_cache_embeddings(self) -> bool:
"""Returns true if the encoder's output embeddings will not change during training."""
return not self.trainable and self.reduce_output != "attention"
@DeveloperAPI
@ludwig_dataclass
class HFEncoderImplConfig(HFEncoderConfig):
"""This dataclass configures the base HF encoder implmenetation."""
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["HFEncoder"]["use_pretrained"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["HFEncoder"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
)
# Internal params set based on preprocessing metadata
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="",
parameter_metadata=INTERNAL_ONLY,
)
vocab_size: int = schema_utils.PositiveInteger(
default=None,
description="",
parameter_metadata=INTERNAL_ONLY,
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description=(
"Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub."
),
parameter_metadata=INTERNAL_ONLY,
)
@DeveloperAPI
@register_encoder_config("albert", TEXT)
@ludwig_dataclass
class ALBERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an ALBERT encoder."""
@staticmethod
def module_name():
return "ALBERT"
type: str = schema_utils.ProtectedString(
"albert",
description=ENCODER_METADATA["ALBERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="albert-base-v2",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["reduce_output"],
)
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["ALBERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30000,
description="Vocabulary size of the ALBERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["vocab_size"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=128,
description="Dimensionality of vocabulary embeddings.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["embedding_size"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["hidden_size"],
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["num_hidden_layers"],
)
num_hidden_groups: int = schema_utils.PositiveInteger(
default=1,
description="Number of groups for the hidden layers, parameters in the same group are shared.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["num_hidden_groups"],
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["num_attention_heads"],
)
intermediate_size: int = schema_utils.PositiveInteger(
default=3072,
description="The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer "
"encoder.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["intermediate_size"],
)
inner_group_num: int = schema_utils.PositiveInteger(
default=1,
description="The number of inner repetition of attention and ffn.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["inner_group_num"],
)
hidden_act: str = schema_utils.StringOptions(
["gelu", "relu", "silu", "gelu_new"],
default="gelu_new",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["hidden_act"],
)
hidden_dropout_prob: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["hidden_dropout_prob"],
)
attention_probs_dropout_prob: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["attention_probs_dropout_prob"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["ALBERT"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=2,
description="The vocabulary size of the token_type_ids passed when calling AlbertModel or TFAlbertModel.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["type_vocab_size"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["layer_norm_eps"],
)
classifier_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for attached classifiers.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["classifier_dropout_prob"],
)
position_embedding_type: str = schema_utils.StringOptions(
["absolute", "relative_key", "relative_key_query"],
default="absolute",
description="",
parameter_metadata=ENCODER_METADATA["ALBERT"]["position_embedding_type"],
)
pad_token_id: int = schema_utils.Integer(
default=0,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=2,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["bos_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=3,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["eos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["ALBERT"]["pretrained_kwargs"],
)
# TODO: uncomment when sentencepiece doesn't cause segfaults: https://github.com/ludwig-ai/ludwig/issues/2983
@DeveloperAPI
# @register_encoder_config("mt5", TEXT)
@ludwig_dataclass
class MT5Config(HFEncoderConfig):
"""This dataclass configures the schema used for an MT5 encoder."""
@staticmethod
def module_name():
return "MT5"
type: str = schema_utils.ProtectedString(
"mt5",
description=ENCODER_METADATA["MT5"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["MT5"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["MT5"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="google/mt5-base",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["MT5"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["MT5"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["MT5"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["MT5"]["reduce_output"],
)
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["MT5"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=250112,
description="Vocabulary size of the T5 model. Defines the number of different tokens that can be represented "
"by the inputs_ids passed when calling T5Model or TFT5Model.",
parameter_metadata=ENCODER_METADATA["MT5"]["vocab_size"],
)
d_model: int = schema_utils.PositiveInteger(
default=512,
description="Size of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["MT5"]["d_model"],
)
d_kv: int = schema_utils.PositiveInteger(
default=64,
description="Size of the key, query, value projections per attention head. d_kv has to be equal to d_model // "
"num_heads.",
parameter_metadata=ENCODER_METADATA["MT5"]["d_kv"],
)
d_ff: int = schema_utils.PositiveInteger(
default=1024,
description="Size of the intermediate feed forward layer in each T5Block.",
parameter_metadata=ENCODER_METADATA["MT5"]["d_ff"],
)
num_layers: int = schema_utils.PositiveInteger(
default=8,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["MT5"]["num_layers"],
)
num_decoder_layers: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not "
"set.",
parameter_metadata=ENCODER_METADATA["MT5"]["num_decoder_layers"],
)
num_heads: int = schema_utils.PositiveInteger(
default=6,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["MT5"]["num_heads"],
)
relative_attention_num_buckets: int = schema_utils.PositiveInteger(
default=32,
description="The number of buckets to use for each attention layer.",
parameter_metadata=ENCODER_METADATA["MT5"]["relative_attention_num_buckets"],
)
dropout_rate: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The ratio for all dropout layers.",
parameter_metadata=ENCODER_METADATA["MT5"]["dropout_rate"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-06,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["MT5"]["layer_norm_epsilon"],
)
initializer_factor: float = schema_utils.NonNegativeFloat(
default=1.0,
description="A factor for initializing all weight matrices (should be kept to 1, used internally for "
"initialization testing)",
parameter_metadata=ENCODER_METADATA["MT5"]["initializer_factor"],
)
feed_forward_proj: str = schema_utils.StringOptions(
["relu", "gated-gelu"],
default="gated-gelu",
description="Type of feed forward layer to be used. ",
parameter_metadata=ENCODER_METADATA["MT5"]["feed_forward_proj"],
)
is_encoder_decoder: bool = schema_utils.Boolean(
default=True,
description="",
parameter_metadata=ENCODER_METADATA["MT5"]["is_encoder_decoder"],
)
use_cache: bool = schema_utils.Boolean(
default=True,
description="",
parameter_metadata=ENCODER_METADATA["MT5"]["use_cache"],
)
tokenizer_class: str = schema_utils.String(
default="T5Tokenizer",
description="",
parameter_metadata=ENCODER_METADATA["MT5"]["tokenizer_class"],
)
tie_word_embeddings: bool = schema_utils.Boolean(
default=False,
description="Whether the model's input and output word embeddings should be tied.",
parameter_metadata=ENCODER_METADATA["MT5"]["tie_word_embeddings"],
)
pad_token_id: int = schema_utils.Integer(
default=0,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["MT5"]["pad_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=1,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["MT5"]["eos_token_id"],
)
decoder_start_token_id: int = schema_utils.Integer(
default=0,
description="If an encoder-decoder model starts decoding with a different token than _bos_, the id of that "
"token.",
parameter_metadata=ENCODER_METADATA["MT5"]["decoder_start_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["MT5"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("xlmroberta", TEXT)
@ludwig_dataclass
class XLMRoBERTaConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an XLMRoBERTa encoder."""
@staticmethod
def module_name():
return "XLMRoBERTa"
type: str = schema_utils.ProtectedString(
"xlmroberta",
description=ENCODER_METADATA["XLMRoBERTa"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="xlm-roberta-base",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Vocabulary size of the XLMRoBERTa model.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["vocab_size"],
)
pad_token_id: int = schema_utils.Integer(
default=1,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=0,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["bos_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=2,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["eos_token_id"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=514,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=1,
description="The vocabulary size of the token_type_ids passed in.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["type_vocab_size"],
)
add_pooling_layer: bool = schema_utils.Boolean(
default=True,
description="Whether to add a pooling layer to the encoder.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["add_pooling_layer"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLMRoBERTa"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("bert", TEXT)
@ludwig_dataclass
class BERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an BERT encoder."""
@staticmethod
def module_name():
return "BERT"
type: str = schema_utils.ProtectedString(
"bert",
description=ENCODER_METADATA["BERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["BERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["BERT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="bert-base-uncased",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["BERT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["BERT"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["BERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["BERT"]["reduce_output"],
)
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["BERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30522,
description="Vocabulary size of the BERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling BertModel or TFBertModel.",
parameter_metadata=ENCODER_METADATA["BERT"]["vocab_size"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["BERT"]["hidden_size"],
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["BERT"]["num_hidden_layers"],
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["BERT"]["num_attention_heads"],
)
intermediate_size: int = schema_utils.PositiveInteger(
default=3072,
description="Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["BERT"]["intermediate_size"],
)
hidden_act: str | Callable = schema_utils.StringOptions( # TODO: add support for callable
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["BERT"]["hidden_act"],
)
hidden_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["BERT"]["hidden_dropout_prob"],
)
attention_probs_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["BERT"]["attention_probs_dropout_prob"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["BERT"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=2,
description="The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.",
parameter_metadata=ENCODER_METADATA["BERT"]["type_vocab_size"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["BERT"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["BERT"]["layer_norm_eps"],
)
pad_token_id: int = schema_utils.Integer(
default=0,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["BERT"]["pad_token_id"],
)
gradient_checkpointing: bool = schema_utils.Boolean(
default=False,
description="Whether to use gradient checkpointing.",
parameter_metadata=ENCODER_METADATA["BERT"]["gradient_checkpointing"],
)
position_embedding_type: str = schema_utils.StringOptions(
["absolute", "relative_key", "relative_key_query"],
default="absolute",
description="Type of position embedding.",
parameter_metadata=ENCODER_METADATA["BERT"]["position_embedding_type"],
)
classifier_dropout: float = schema_utils.FloatRange(
default=None,
allow_none=True,
min=0,
max=1,
description="The dropout ratio for the classification head.",
parameter_metadata=ENCODER_METADATA["BERT"]["classifier_dropout"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["BERT"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("deberta", TEXT)
@ludwig_dataclass
class DebertaV2Config(HFEncoderImplConfig, DebertaModelParams):
"""This dataclass configures the schema used for a DeBERTa-v2 / v3 encoder."""
@staticmethod
def module_name():
return "DeBERTa"
type: str = schema_utils.ProtectedString(
"deberta",
description=ENCODER_METADATA["DeBERTa"]["type"].long_description,
)
pretrained_model_name_or_path: str = schema_utils.String(
default="tasksource/deberta-base-long-nli",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["DeBERTa"]["pretrained_model_name_or_path"],
)
reduce_output: str = schema_utils.StringOptions(
["cls_pooled", "last", "sum", "mean", "max", "concat", "attention"],
default="sum",
allow_none=True,
description="The method used to reduce a sequence of tensors down to a single tensor.",
)
# TODO: uncomment once we figure out host memory issue: https://github.com/ludwig-ai/ludwig/issues/3107
@DeveloperAPI
# @register_encoder_config("xlm", TEXT)
@ludwig_dataclass
class XLMConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an XLM encoder."""
@staticmethod
def module_name():
return "XLM"
type: str = schema_utils.ProtectedString(
"xlm",
description=ENCODER_METADATA["XLM"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["XLM"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["XLM"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="xlm-mlm-en-2048",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLM"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["XLM"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["XLM"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["XLM"]["reduce_output"],
)
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["XLM"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30145,
description="Vocabulary size of the BERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling XLMModel or TFXLMModel.",
parameter_metadata=ENCODER_METADATA["XLM"]["vocab_size"],
)
emb_dim: int = schema_utils.PositiveInteger(
default=2048,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["XLM"]["emb_dim"],
)
n_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLM"]["n_layers"],
)
n_heads: int = schema_utils.PositiveInteger(
default=16,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLM"]["n_heads"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["XLM"]["dropout"],
)
attention_dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for the attention mechanism.",
parameter_metadata=ENCODER_METADATA["XLM"]["attention_dropout"],
)
gelu_activation: bool = schema_utils.Boolean(
default=True,
description="Whether or not to use gelu for the activations instead of relu.",
parameter_metadata=ENCODER_METADATA["XLM"]["gelu_activation"],
)
sinusoidal_embeddings: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.",
parameter_metadata=ENCODER_METADATA["XLM"]["sinusoidal_embeddings"],
)
causal: bool = schema_utils.Boolean(
default=False,
description="Whether or not the model should behave in a causal manner. Causal models use a triangular "
"attention mask in order to only attend to the left-side context instead if a bidirectional "
"context.",
parameter_metadata=ENCODER_METADATA["XLM"]["causal"],
)
asm: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the "
"prediction layer.",
parameter_metadata=ENCODER_METADATA["XLM"]["asm"],
)
n_langs: int = schema_utils.PositiveInteger(
default=1,
description="The number of languages the model handles. Set to 1 for monolingual models.",
parameter_metadata=ENCODER_METADATA["XLM"]["n_langs"],
)
use_lang_emb: bool = schema_utils.Boolean(
default=True,
description="Whether to use language embeddings. Some models use additional language embeddings, "
"see the multilingual models page for information on how to use them.",
parameter_metadata=ENCODER_METADATA["XLM"]["use_lang_emb"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["XLM"]["max_position_embeddings"],
)
embed_init_std: float = schema_utils.NonNegativeFloat(
default=2048**-0.5,
description="The standard deviation of the truncated_normal_initializer for initializing the embedding "
"matrices.",
parameter_metadata=ENCODER_METADATA["XLM"]["embed_init_std"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["XLM"]["layer_norm_eps"],
)
init_std: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices "
"except the embedding matrices.",
parameter_metadata=ENCODER_METADATA["XLM"]["init_std"],
)
bos_index: int = schema_utils.NonNegativeInteger(
default=0,
description="The index of the beginning of sentence token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["bos_index"],
)
eos_index: int = schema_utils.NonNegativeInteger(
default=1,
description="The index of the end of sentence token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["eos_index"],
)
pad_index: int = schema_utils.NonNegativeInteger(
default=2,
description="The index of the padding token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["pad_index"],
)
unk_index: int = schema_utils.NonNegativeInteger(
default=3,
description="The index of the unknown token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["unk_index"],
)
mask_index: int = schema_utils.NonNegativeInteger(
default=5,
description="The index of the masking token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["XLM"]["mask_index"],
)
is_encoder: bool = schema_utils.Boolean(
default=True,
description="Whether or not the initialized model should be a transformer encoder or decoder as seen in "
"Vaswani et al.",
parameter_metadata=ENCODER_METADATA["XLM"]["is_encoder"],
)
start_n_top: int = schema_utils.PositiveInteger(
default=5,
description="Used in the SQuAD evaluation script.",
parameter_metadata=ENCODER_METADATA["XLM"]["start_n_top"],
)
end_n_top: int = schema_utils.PositiveInteger(
default=5,
description="Used in the SQuAD evaluation script.",
parameter_metadata=ENCODER_METADATA["XLM"]["end_n_top"],
)
mask_token_id: int = schema_utils.Integer(
default=0,
description="Model agnostic parameter to identify masked tokens when generating text in an MLM context.",
parameter_metadata=ENCODER_METADATA["XLM"]["mask_token_id"],
)
lang_id: int = schema_utils.Integer(
default=0,
description="The ID of the language used by the model. This parameter is used when generating text in a given "
"language.",
parameter_metadata=ENCODER_METADATA["XLM"]["lang_id"],
)
pad_token_id: int = schema_utils.Integer(
default=2,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["XLM"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=0,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLM"]["bos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLM"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("gpt", TEXT)
@ludwig_dataclass
class GPTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an GPT encoder."""
@staticmethod
def module_name():
return "GPT"
type: str = schema_utils.ProtectedString(
"gpt",
description=ENCODER_METADATA["GPT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["GPT"]["max_sequence_length"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["GPT"]["reduce_output"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["GPT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="openai-gpt",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["GPT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["GPT"]["saved_weights_in_checkpoint"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["GPT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["GPT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30522,
description="Vocabulary size of the GPT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling OpenAIGPTModel or TFOpenAIGPTModel.",
parameter_metadata=ENCODER_METADATA["GPT"]["vocab_size"],
)
n_positions: int = schema_utils.PositiveInteger(
default=40478,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["GPT"]["n_positions"],
)
n_ctx: int = schema_utils.PositiveInteger(
default=512,
description="Dimensionality of the causal mask (usually same as n_positions)",
parameter_metadata=ENCODER_METADATA["GPT"]["n_ctx"],
)
n_embd: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the embeddings and hidden states.",
parameter_metadata=ENCODER_METADATA["GPT"]["n_embd"],
)
n_layer: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["GPT"]["n_layer"],
)
n_head: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["GPT"]["n_head"],
)
afn: str = schema_utils.StringOptions(
["gelu", "relu", "silu"], # gelu_new results in a KeyError.
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["GPT"]["afn"],
)
resid_pdrop: float = schema_utils.FloatRange(
default=0.1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["GPT"]["resid_pdrop"],
)
embd_pdrop: float = schema_utils.FloatRange(
default=0.1,
description="The dropout ratio for the embeddings.",
parameter_metadata=ENCODER_METADATA["GPT"]["embd_pdrop"],
)
attn_pdrop: float = schema_utils.FloatRange(
default=0.1,
description="The dropout ratio for the attention.",
parameter_metadata=ENCODER_METADATA["GPT"]["attn_pdrop"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-5,
description="The epsilon to use in the layer normalization layers",
parameter_metadata=ENCODER_METADATA["GPT"]["layer_norm_epsilon"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["GPT"]["initializer_range"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["GPT"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("gpt2", TEXT)
@ludwig_dataclass
class GPT2Config(HFEncoderConfig):
"""This dataclass configures the schema used for an GPT2 encoder."""
@staticmethod
def module_name():
return "GPT2"
type: str = schema_utils.ProtectedString(
"gpt2",
description=ENCODER_METADATA["GPT2"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["GPT2"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["GPT2"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="gpt2",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["GPT2"]["pretrained_model_name_or_path"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["GPT2"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["GPT2"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["GPT2"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=50257,
description="Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling GPT2Model or TFGPT2Model.",
parameter_metadata=ENCODER_METADATA["GPT2"]["vocab_size"],
)
n_positions: int = schema_utils.PositiveInteger(
default=1024,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_positions"],
)
n_ctx: int = schema_utils.PositiveInteger(
default=1024,
description="Dimensionality of the causal mask (usually same as n_positions)",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_ctx"],
)
n_embd: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the embeddings and hidden states.",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_embd"],
)
n_layer: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_layer"],
)
n_head: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_head"],
)
n_inner: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Dimensionality of the inner feed-forward layers. None will set it to 4 times n_embd",
parameter_metadata=ENCODER_METADATA["GPT2"]["n_inner"],
)
activation_function: str = schema_utils.StringOptions(
["relu", "silu", "gelu", "tanh", "gelu_new"],
default="gelu_new",
description="Activation function, to be selected in the list ['relu', 'silu', 'gelu', 'tanh', 'gelu_new'].",
parameter_metadata=ENCODER_METADATA["GPT2"]["activation_function"],
)
resid_pdrop: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["GPT2"]["resid_pdrop"],
)
embd_pdrop: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the embeddings.",
parameter_metadata=ENCODER_METADATA["GPT2"]["embd_pdrop"],
)
attn_pdrop: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the attention.",
parameter_metadata=ENCODER_METADATA["GPT2"]["attn_pdrop"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-5,
description="The epsilon to use in the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["GPT2"]["layer_norm_epsilon"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["GPT2"]["initializer_range"],
)
scale_attn_weights: bool = schema_utils.Boolean(
default=True,
description="Scale attention weights by dividing by sqrt(hidden_size).",
parameter_metadata=ENCODER_METADATA["GPT2"]["scale_attn_weights"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["GPT2"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("roberta", TEXT)
@ludwig_dataclass
class RoBERTaConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an RoBERTa encoder."""
@staticmethod
def module_name():
return "RoBERTa"
type: str = schema_utils.ProtectedString(
"roberta",
description=ENCODER_METADATA["RoBERTa"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="roberta-base",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Vocabulary size of the RoBERTa model.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["vocab_size"],
)
pad_token_id: int = schema_utils.Integer(
default=1,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=0,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["bos_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=2,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["eos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["RoBERTa"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("transformer_xl", TEXT)
@ludwig_dataclass
class TransformerXLConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an TransformerXL encoder."""
@staticmethod
def module_name():
return "TransformerXL"
type: str = schema_utils.ProtectedString(
"transformer_xl",
description=ENCODER_METADATA["TransformerXL"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="transfo-xl-wt103",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=267735,
description="Vocabulary size of the TransfoXL model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling TransfoXLModel or TFTransfoXLModel.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["vocab_size"],
)
cutoffs: list[int] = schema_utils.List(
int,
default=[20000, 40000, 200000],
description="Cutoffs for the adaptive softmax.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["cutoffs"],
)
d_model: int = schema_utils.PositiveInteger(
default=1024,
description="Dimensionality of the model’s hidden states.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["d_model"],
)
d_embed: int = schema_utils.PositiveInteger(
default=1024,
description="Dimensionality of the embeddings",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["d_embed"],
)
n_head: int = schema_utils.PositiveInteger(
default=16,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["n_head"],
)
d_head: int = schema_utils.PositiveInteger(
default=64,
description="Dimensionality of the model’s heads.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["d_head"],
)
d_inner: int = schema_utils.PositiveInteger(
default=4096,
description=" Inner dimension in FF",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["d_inner"],
)
div_val: int = schema_utils.PositiveInteger(
default=4,
description="Divident value for adapative input and softmax.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["div_val"],
)
pre_lnorm: bool = schema_utils.Boolean(
default=False,
description="Whether or not to apply LayerNorm to the input instead of the output in the blocks.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["pre_lnorm"],
)
n_layer: int = schema_utils.PositiveInteger(
default=18,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["n_layer"],
)
mem_len: int = schema_utils.PositiveInteger(
default=1600,
description="Length of the retained previous heads.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["mem_len"],
)
clamp_len: int = schema_utils.PositiveInteger(
default=1000,
description="Use the same pos embeddings after clamp_len.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["clamp_len"],
)
same_length: bool = schema_utils.Boolean(
default=True,
description="Whether or not to use the same attn length for all tokens",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["same_length"],
)
proj_share_all_but_first: bool = schema_utils.Boolean(
default=True,
description="True to share all but first projs, False not to share.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["proj_share_all_but_first"],
)
attn_type: int = schema_utils.IntegerRange(
default=0,
min=0,
max=3,
description="Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["attn_type"],
)
sample_softmax: int = schema_utils.Integer(
default=-1,
description="Number of samples in the sampled softmax.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["sample_softmax"],
)
adaptive: bool = schema_utils.Boolean(
default=True,
description="Whether or not to use adaptive softmax.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["adaptive"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["dropout"],
)
dropatt: float = schema_utils.NonNegativeFloat(
default=0.0,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["dropatt"],
)
untie_r: bool = schema_utils.Boolean(
default=True,
description="Whether ot not to untie relative position biases.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["untie_r"],
)
init: str = schema_utils.String(
default="normal",
description="Parameter initializer to use.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["init"],
)
init_range: float = schema_utils.NonNegativeFloat(
default=0.01,
description="Parameters initialized by U(-init_range, init_range).",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["init_range"],
)
proj_init_std: float = schema_utils.NonNegativeFloat(
default=0.01,
description="Parameters initialized by N(0, init_std)",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["proj_init_std"],
)
init_std: float = schema_utils.NonNegativeFloat(
default=0.02,
description="Parameters initialized by N(0, init_std)",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["init_std"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-5,
description="The epsilon to use in the layer normalization layers",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["layer_norm_epsilon"],
)
eos_token_id: int = schema_utils.Integer(
default=0,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["eos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["TransformerXL"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("xlnet", TEXT)
@ludwig_dataclass
class XLNetConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an XLNet encoder."""
@staticmethod
def module_name():
return "XLNet"
type: str = schema_utils.ProtectedString(
"xlnet",
description=ENCODER_METADATA["XLNet"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["XLNet"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["XLNet"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="xlnet-base-cased",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLNet"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["XLNet"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["XLNet"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["XLNet"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["XLNet"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=32000,
description="Vocabulary size of the XLNet model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling XLNetModel or TFXLNetModel.",
parameter_metadata=ENCODER_METADATA["XLNet"]["vocab_size"],
)
d_model: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["XLNet"]["d_model"],
)
n_layer: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLNet"]["n_layer"],
)
n_head: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLNet"]["n_head"],
)
d_inner: int = schema_utils.PositiveInteger(
default=3072,
description="Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["XLNet"]["d_inner"],
)
ff_activation: str = schema_utils.StringOptions(
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler. If string, "
"'gelu', 'relu', 'silu' and 'gelu_new' are supported.",
parameter_metadata=ENCODER_METADATA["XLNet"]["ff_activation"],
)
untie_r: bool = schema_utils.Boolean(
default=True,
description="Whether or not to untie relative position biases",
parameter_metadata=ENCODER_METADATA["XLNet"]["untie_r"],
)
attn_type: str = schema_utils.StringOptions(
["bi"],
default="bi",
description="The attention type used by the model. Currently only 'bi' is supported.",
parameter_metadata=ENCODER_METADATA["XLNet"]["attn_type"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["XLNet"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["XLNet"]["layer_norm_eps"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["XLNet"]["dropout"],
)
mem_len: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The number of tokens to cache. The key/value pairs that have already been pre-computed in a "
"previous forward pass won’t be re-computed. ",
parameter_metadata=ENCODER_METADATA["XLNet"]["mem_len"],
)
reuse_len: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The number of tokens in the current batch to be cached and reused in the future.",
parameter_metadata=ENCODER_METADATA["XLNet"]["reuse_len"],
)
use_mems_eval: bool = schema_utils.Boolean(
default=True,
description="Whether or not the model should make use of the recurrent memory mechanism in evaluation mode.",
parameter_metadata=ENCODER_METADATA["XLNet"]["use_mems_eval"],
)
use_mems_train: bool = schema_utils.Boolean(
default=False,
description="Whether or not the model should make use of the recurrent memory mechanism in train mode.",
parameter_metadata=ENCODER_METADATA["XLNet"]["use_mems_train"],
)
bi_data: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use bidirectional input pipeline. Usually set to True during pretraining and "
"False during finetuning.",
parameter_metadata=ENCODER_METADATA["XLNet"]["bi_data"],
)
clamp_len: int = schema_utils.Integer(
default=-1,
description="Clamp all relative distances larger than clamp_len. Setting this attribute to -1 means no "
"clamping.",
parameter_metadata=ENCODER_METADATA["XLNet"]["clamp_len"],
)
same_length: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use the same attention length for each token.",
parameter_metadata=ENCODER_METADATA["XLNet"]["same_length"],
)
summary_type: str = schema_utils.StringOptions(
["last", "first", "mean", "cls_index", "attn"],
default="last",
description="Argument used when doing sequence summary. Used in the sequence classification and multiple "
"choice models.",
parameter_metadata=ENCODER_METADATA["XLNet"]["summary_type"],
)
summary_use_proj: bool = schema_utils.Boolean(
default=True,
description="",
parameter_metadata=ENCODER_METADATA["XLNet"]["summary_use_proj"],
)
summary_activation: str = schema_utils.String(
default="tanh",
description="Argument used when doing sequence summary. Used in the sequence classification and multiple "
"choice models.",
parameter_metadata=ENCODER_METADATA["XLNet"]["summary_activation"],
)
summary_last_dropout: float = schema_utils.FloatRange(
default=0.1,
description="Used in the sequence classification and multiple choice models.",
parameter_metadata=ENCODER_METADATA["XLNet"]["summary_last_dropout"],
)
start_n_top: int = schema_utils.PositiveInteger(
default=5,
description="Used in the SQuAD evaluation script.",
parameter_metadata=ENCODER_METADATA["XLNet"]["start_n_top"],
)
end_n_top: int = schema_utils.PositiveInteger(
default=5,
description=" Used in the SQuAD evaluation script.",
parameter_metadata=ENCODER_METADATA["XLNet"]["end_n_top"],
)
pad_token_id: int = schema_utils.Integer(
default=5,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["XLNet"]["pad_token_id"],
)
bos_token_id: int = schema_utils.Integer(
default=1,
description="The beginning of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLNet"]["bos_token_id"],
)
eos_token_id: int = schema_utils.Integer(
default=2,
description="The end of sequence token ID.",
parameter_metadata=ENCODER_METADATA["XLNet"]["eos_token_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["XLNet"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("distilbert", TEXT)
@ludwig_dataclass
class DistilBERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an DistilBERT encoder."""
@staticmethod
def module_name():
return "DistilBERT"
type: str = schema_utils.ProtectedString(
"distilbert",
description=ENCODER_METADATA["DistilBERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="distilbert-base-uncased",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30522,
description="Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling DistilBertModel or TFDistilBertModel.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["vocab_size"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["dropout"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["max_position_embeddings"],
)
sinusoidal_pos_embds: bool = schema_utils.Boolean(
default=False,
description="Whether to use sinusoidal positional embeddings.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["sinusoidal_pos_embds"],
)
n_layers: int = schema_utils.PositiveInteger(
default=6,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["n_layers"],
)
n_heads: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["n_heads"],
)
dim: int = schema_utils.PositiveInteger(
default=768,
description=" Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["dim"],
)
hidden_dim: int = schema_utils.PositiveInteger(
default=3072,
description="The size of the “intermediate” (often named feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["hidden_dim"],
)
attention_dropout: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["attention_dropout"],
)
activation: str | Callable = schema_utils.StringOptions( # TODO: Add support for callable
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler. If string, "
"'gelu', 'relu', 'silu' and 'gelu_new' are supported.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["activation"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["initializer_range"],
)
qa_dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probabilities used in the question answering model DistilBertForQuestionAnswering.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["qa_dropout"],
)
seq_classif_dropout: float = schema_utils.FloatRange(
default=0.2,
min=0,
max=1,
description="The dropout probabilities used in the sequence classification and the multiple choice model "
"DistilBertForSequenceClassification.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["seq_classif_dropout"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["DistilBERT"]["pretrained_kwargs"],
)
# TODO: uncomment when CTRL bug (https://github.com/ludwig-ai/ludwig/issues/2977) has been fixed to add back in
@DeveloperAPI
# @register_encoder_config("ctrl", TEXT)
@ludwig_dataclass
class CTRLConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an CTRL encoder."""
@staticmethod
def module_name():
return "CTRL"
type: str = schema_utils.ProtectedString(
"ctrl",
description=ENCODER_METADATA["CTRL"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["CTRL"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["CTRL"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="ctrl",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["CTRL"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["CTRL"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["CTRL"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["CTRL"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["CTRL"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=246534,
description="Vocabulary size of the CTRL model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling CTRLModel or TFCTRLModel.",
parameter_metadata=ENCODER_METADATA["CTRL"]["vocab_size"],
)
n_positions: int = schema_utils.PositiveInteger(
default=256,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_positions"],
)
n_ctx: int = schema_utils.PositiveInteger(
default=256,
description="Dimensionality of the causal mask (usually same as n_positions)",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_ctx"],
)
n_embd: int = schema_utils.PositiveInteger(
default=1280,
description="Dimensionality of the embeddings and hidden states.",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_embd"],
)
dff: int = schema_utils.PositiveInteger(
default=8192,
description="Dimensionality of the inner dimension of the feed forward networks (FFN).",
parameter_metadata=ENCODER_METADATA["CTRL"]["dff"],
)
n_layer: int = schema_utils.PositiveInteger(
default=48,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_layer"],
)
n_head: int = schema_utils.PositiveInteger(
default=16,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CTRL"]["n_head"],
)
resid_pdrop: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description=" The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["CTRL"]["resid_pdrop"],
)
embd_pdrop: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout ratio for the embeddings.",
parameter_metadata=ENCODER_METADATA["CTRL"]["embd_pdrop"],
)
attn_pdrop: float = schema_utils.NonNegativeFloat(
default=0.1,
description="The dropout ratio for the attention.",
parameter_metadata=ENCODER_METADATA["CTRL"]["attn_pdrop"],
)
layer_norm_epsilon: float = schema_utils.NonNegativeFloat(
default=1e-6,
description="The epsilon to use in the layer normalization layers",
parameter_metadata=ENCODER_METADATA["CTRL"]["layer_norm_epsilon"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["CTRL"]["initializer_range"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["CTRL"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("camembert", TEXT)
@ludwig_dataclass
class CamemBERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an CamemBERT encoder."""
@staticmethod
def module_name():
return "CamemBERT"
type: str = schema_utils.ProtectedString(
"camembert",
description=ENCODER_METADATA["CamemBERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["use_pretrained"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["saved_weights_in_checkpoint"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="camembert-base",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["pretrained_model_name_or_path"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=32005,
description="Vocabulary size of the CamemBERT model.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["vocab_size"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=768,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["hidden_size"],
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["num_hidden_layers"],
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=12,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["num_attention_heads"],
)
intermediate_size: int = schema_utils.PositiveInteger(
default=3072,
description="Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["intermediate_size"],
)
hidden_act: str | Callable = schema_utils.StringOptions( # TODO: add support for callable
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["hidden_act"],
)
hidden_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["hidden_dropout_prob"],
)
attention_probs_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["attention_probs_dropout_prob"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=514,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=1,
description="The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["type_vocab_size"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-05,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["layer_norm_eps"],
)
pad_token_id: int = schema_utils.Integer(
default=1,
description="The ID of the token to use as padding.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["pad_token_id"],
)
gradient_checkpointing: bool = schema_utils.Boolean(
default=False,
description="Whether to use gradient checkpointing.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["gradient_checkpointing"],
)
position_embedding_type: str = schema_utils.StringOptions(
["absolute", "relative_key", "relative_key_query"],
default="absolute",
description="Type of position embedding.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["position_embedding_type"],
)
classifier_dropout: float = schema_utils.FloatRange(
default=None,
allow_none=True,
min=0,
max=1,
description="The dropout ratio for the classification head.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["classifier_dropout"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["CamemBERT"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("t5", TEXT)
@ludwig_dataclass
class T5Config(HFEncoderConfig):
"""This dataclass configures the schema used for an T5 encoder."""
@staticmethod
def module_name():
return "T5"
type: str = schema_utils.ProtectedString(
"t5",
description=ENCODER_METADATA["T5"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["T5"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["T5"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="t5-small",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["T5"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["T5"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["T5"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["T5"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["T5"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=32128,
description="Vocabulary size of the T5 model. Defines the number of different tokens that can be represented "
"by the inputs_ids passed when calling T5Model or TFT5Model.",
parameter_metadata=ENCODER_METADATA["T5"]["vocab_size"],
)
d_model: int = schema_utils.PositiveInteger(
default=512,
description="Size of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["T5"]["d_model"],
)
d_kv: int = schema_utils.PositiveInteger(
default=64,
description="Size of the key, query, value projections per attention head. d_kv has to be equal to d_model // "
"num_heads.",
parameter_metadata=ENCODER_METADATA["T5"]["d_kv"],
)
d_ff: int = schema_utils.PositiveInteger(
default=2048,
description="Size of the intermediate feed forward layer in each T5Block.",
parameter_metadata=ENCODER_METADATA["T5"]["d_ff"],
)
num_layers: int = schema_utils.PositiveInteger(
default=6,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["T5"]["num_layers"],
)
num_decoder_layers: int = schema_utils.PositiveInteger(
default=6,
description="Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not "
"set.",
parameter_metadata=ENCODER_METADATA["T5"]["num_decoder_layers"],
)
num_heads: int = schema_utils.PositiveInteger(
default=8,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["T5"]["num_heads"],
)
relative_attention_num_buckets: int = schema_utils.PositiveInteger(
default=32,
description="The number of buckets to use for each attention layer.",
parameter_metadata=ENCODER_METADATA["T5"]["relative_attention_num_buckets"],
)
dropout_rate: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The ratio for all dropout layers.",
parameter_metadata=ENCODER_METADATA["T5"]["dropout_rate"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-6,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["T5"]["layer_norm_eps"],
)
initializer_factor: float = schema_utils.NonNegativeFloat(
default=1,
description="A factor for initializing all weight matrices (should be kept to 1, used internally for "
"initialization testing).",
parameter_metadata=ENCODER_METADATA["T5"]["initializer_factor"],
)
feed_forward_proj: str = schema_utils.StringOptions(
["relu", "gated-gelu"],
default="relu",
description="Type of feed forward layer to be used. Should be one of 'relu' or 'gated-gelu'. T5v1.1 uses the "
"'gated-gelu' feed forward projection. Original T5 uses 'relu'.",
parameter_metadata=ENCODER_METADATA["T5"]["feed_forward_proj"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["T5"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("flaubert", TEXT)
@ludwig_dataclass
class FlauBERTConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an FlauBERT encoder."""
@staticmethod
def module_name():
return "FlauBERT"
type: str = schema_utils.ProtectedString(
"flaubert",
description=ENCODER_METADATA["FlauBERT"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="flaubert/flaubert_small_cased",
description="Name of path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30145,
description="Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling FlaubertModel or TFFlaubertModel.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["vocab_size"],
)
pre_norm: bool = schema_utils.Boolean(
default=True,
description="Whether to apply the layer normalization before or after the feed forward layer following the "
"attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["pre_norm"],
)
layerdrop: float = schema_utils.FloatRange(
default=0.2,
min=0,
max=1,
description="Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand "
"with Structured Dropout. ICLR 2020)",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["layerdrop"],
)
emb_dim: int = schema_utils.PositiveInteger(
default=512,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["emb_dim"],
)
n_layers: int = schema_utils.PositiveInteger(
default=6,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["n_layers"],
)
n_heads: int = schema_utils.PositiveInteger(
default=8,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["n_heads"],
)
dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["dropout"],
)
attention_dropout: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for the attention mechanism",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["attention_dropout"],
)
gelu_activation: bool = schema_utils.Boolean(
default=True,
description="Whether or not to use a gelu activation instead of relu.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["gelu_activation"],
)
sinusoidal_embeddings: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["sinusoidal_embeddings"],
)
causal: bool = schema_utils.Boolean(
default=False,
description="Whether or not the model should behave in a causal manner. Causal models use a triangular "
"attention mask in order to only attend to the left-side context instead if a bidirectional "
"context.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["causal"],
)
asm: bool = schema_utils.Boolean(
default=False,
description="Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the "
"prediction layer.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["asm"],
)
n_langs: int = schema_utils.PositiveInteger(
default=1,
description="The number of languages the model handles. Set to 1 for monolingual models.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["n_langs"],
)
use_lang_emb: bool = schema_utils.Boolean(
default=True,
description="Whether to use language embeddings. Some models use additional language embeddings, "
"see the multilingual models page for information on how to use them.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["use_lang_emb"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["max_position_embeddings"],
)
embed_init_std: float = schema_utils.NonNegativeFloat(
default=2048**-0.5,
description="The standard deviation of the truncated_normal_initializer for initializing the embedding "
"matrices.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["embed_init_std"],
)
init_std: int = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices "
"except the embedding matrices.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["init_std"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-06,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["layer_norm_eps"],
)
bos_index: int = schema_utils.NonNegativeInteger(
default=0,
description="The index of the beginning of sentence token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["bos_index"],
)
eos_index: int = schema_utils.NonNegativeInteger(
default=1,
description="The index of the end of sentence token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["eos_index"],
)
pad_index: int = schema_utils.NonNegativeInteger(
default=2,
description="The index of the padding token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["pad_index"],
)
unk_index: int = schema_utils.NonNegativeInteger(
default=3,
description="The index of the unknown token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["unk_index"],
)
mask_index: int = schema_utils.NonNegativeInteger(
default=5,
description="The index of the masking token in the vocabulary.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["mask_index"],
)
is_encoder: bool = schema_utils.Boolean(
default=True,
description="Whether or not the initialized model should be a transformer encoder or decoder as seen in "
"Vaswani et al.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["is_encoder"],
)
mask_token_id: int = schema_utils.Integer(
default=0,
description="Model agnostic parameter to identify masked tokens when generating text in an MLM context.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["mask_token_id"],
)
lang_id: int = schema_utils.Integer(
default=0,
description="The ID of the language used by the model. This parameter is used when generating text in a given "
"language.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["lang_id"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["FlauBERT"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("electra", TEXT)
@ludwig_dataclass
class ELECTRAConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an ELECTRA encoder."""
@staticmethod
def module_name():
return "ELECTRA"
type: str = schema_utils.ProtectedString(
"electra",
description=ENCODER_METADATA["ELECTRA"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["use_pretrained"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="google/electra-small-discriminator",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["saved_weights_in_checkpoint"],
)
reduce_output: str = schema_utils.String(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=30522,
description="Vocabulary size of the ELECTRA model. Defines the number of different tokens that can be "
"represented by the inputs_ids passed when calling ElectraModel or TFElectraModel.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["vocab_size"],
)
embedding_size: int = schema_utils.PositiveInteger(
default=128,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["embedding_size"],
)
hidden_size: int = schema_utils.PositiveInteger(
default=256,
description="Dimensionality of the encoder layers and the pooler layer.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["hidden_size"],
)
num_hidden_layers: int = schema_utils.PositiveInteger(
default=12,
description="Number of hidden layers in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["num_hidden_layers"],
)
num_attention_heads: int = schema_utils.PositiveInteger(
default=4,
description="Number of attention heads for each attention layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["num_attention_heads"],
)
intermediate_size: int = schema_utils.PositiveInteger(
default=1024,
description="Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["intermediate_size"],
)
hidden_act: str | Callable = schema_utils.StringOptions( # TODO: add support for callable
["gelu", "relu", "silu", "gelu_new"],
default="gelu",
description="The non-linear activation function (function or string) in the encoder and pooler.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["hidden_act"],
)
hidden_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["hidden_dropout_prob"],
)
attention_probs_dropout_prob: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="The dropout ratio for the attention probabilities.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["attention_probs_dropout_prob"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=512,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=2,
description="The vocabulary size of the token_type_ids passed when calling ElectraModel or TFElectraModel.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["type_vocab_size"],
)
initializer_range: float = schema_utils.NonNegativeFloat(
default=0.02,
description="The standard deviation of the truncated_normal_initializer for initializing all weight matrices.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["initializer_range"],
)
layer_norm_eps: float = schema_utils.NonNegativeFloat(
default=1e-12,
description="The epsilon used by the layer normalization layers.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["layer_norm_eps"],
)
position_embedding_type: str = schema_utils.StringOptions(
["absolute", "relative_key", "relative_key_query"],
default="absolute",
description="Type of position embedding.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["position_embedding_type"],
)
classifier_dropout: float = schema_utils.FloatRange(
default=None,
allow_none=True,
min=0,
max=1,
description="The dropout ratio for the classification head.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["classifier_dropout"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["ELECTRA"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("longformer", TEXT)
@ludwig_dataclass
class LongformerConfig(HFEncoderConfig):
"""This dataclass configures the schema used for a Longformer encoder."""
@staticmethod
def module_name():
return "Longformer"
type: str = schema_utils.ProtectedString(
"longformer",
description=ENCODER_METADATA["Longformer"]["type"].long_description,
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["Longformer"]["max_sequence_length"],
)
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. If false, the model will train from "
"scratch which is very computationally expensive.",
parameter_metadata=ENCODER_METADATA["Longformer"]["use_pretrained"],
)
attention_window: list[int] | int = schema_utils.OneOfOptionsField(
default=512,
allow_none=False,
description="Size of an attention window around each token. If an int, use the same size for all layers. To "
"specify a different window size for each layer, use a List[int] where len(attention_window) == "
"num_hidden_layers.",
field_options=[
schema_utils.PositiveInteger(allow_none=False, description="", default=512),
schema_utils.List(list_type=int, allow_none=False),
],
parameter_metadata=ENCODER_METADATA["Longformer"]["attention_window"],
)
sep_token_id: int = schema_utils.Integer(
default=2,
description="ID of the separator token, which is used when building a sequence from multiple sequences",
parameter_metadata=ENCODER_METADATA["Longformer"]["sep_token_id"],
)
pretrained_model_name_or_path: str = schema_utils.String(
default="allenai/longformer-base-4096",
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["Longformer"]["pretrained_model_name_or_path"],
)
saved_weights_in_checkpoint: bool = schema_utils.Boolean(
default=False,
description="Are the pretrained encoder weights saved in this model's checkpoint? Automatically set to"
"True for trained models to prevent loading pretrained encoder weights from model hub.",
parameter_metadata=ParameterMetadata(internal_only=True),
)
reduce_output: str = schema_utils.String(
default="cls_pooled",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["Longformer"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["Longformer"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["Longformer"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=50265,
description="Vocabulary size of the Longformer model.",
parameter_metadata=ENCODER_METADATA["Longformer"]["vocab_size"],
)
max_position_embeddings: int = schema_utils.PositiveInteger(
default=4098,
description="The maximum sequence length that this model might ever be used with. Typically set this to "
"something large just in case (e.g., 512 or 1024 or 2048).",
parameter_metadata=ENCODER_METADATA["Longformer"]["max_position_embeddings"],
)
type_vocab_size: int = schema_utils.PositiveInteger(
default=1,
description="The vocabulary size of the token_type_ids passed when calling LongformerEncoder",
parameter_metadata=ENCODER_METADATA["Longformer"]["type_vocab_size"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["Longformer"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("auto_transformer", TEXT)
@ludwig_dataclass
class AutoTransformerConfig(HFEncoderConfig):
"""This dataclass configures the schema used for an AutoTransformer encoder."""
def __post_init__(self):
# Always force use_pretrained=True — we don't support training from scratch for AutoTransformers
self.use_pretrained = True
if self.pretrained_model_name_or_path is None:
raise ConfigValidationError(
"`pretrained_model_name_or_path` must be specified for encoder: `auto_transformer`."
)
@staticmethod
def module_name():
return "AutoTransformer"
# Always True — we don't support training from scratch for AutoTransformers
use_pretrained: bool = schema_utils.Boolean(
default=True,
description="Whether to use the pretrained weights for the model. Always True for AutoTransformers.",
)
type: str = schema_utils.ProtectedString(
"auto_transformer",
description=ENCODER_METADATA["AutoTransformer"]["type"].long_description,
)
pretrained_model_name_or_path: str = schema_utils.String(
default=None,
allow_none=True,
description="Name or path of the pretrained model.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["pretrained_model_name_or_path"],
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Maximum length of the input sequence.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["max_sequence_length"],
)
reduce_output: str = schema_utils.ReductionOptions(
default="sum",
description="The method used to reduce a sequence of tensors down to a single tensor.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["reduce_output"],
)
trainable: bool = schema_utils.Boolean(
default=False,
description="Whether to finetune the model on your dataset.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["trainable"],
)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
vocab: list = schema_utils.List(
default=None,
description="Vocabulary for the encoder",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["vocab"],
)
vocab_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description=(
"Vocabulary size of the AutoTransformer model. If None, the vocab size will be inferred "
"from the given pretrained model"
),
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["vocab_size"],
)
pretrained_kwargs: dict = schema_utils.Dict(
default=None,
description="Additional kwargs to pass to the pretrained model.",
parameter_metadata=ENCODER_METADATA["AutoTransformer"]["pretrained_kwargs"],
)
@DeveloperAPI
@register_encoder_config("tf_idf", TEXT, model_types=[MODEL_ECD])
@ludwig_dataclass
class TfIdfEncoderConfig(SequenceEncoderConfig):
type: str = schema_utils.ProtectedString("tf_idf")
max_sequence_length: int = schema_utils.Integer(default=None, allow_none=True, parameter_metadata=INTERNAL_ONLY)
str2idf: dict[str, int] = schema_utils.Dict(parameter_metadata=INTERNAL_ONLY)
vocab: list = schema_utils.List(default=None, parameter_metadata=INTERNAL_ONLY)
vocab_size: int = schema_utils.Integer(default=None, allow_none=True, parameter_metadata=INTERNAL_ONLY)
def set_fixed_preprocessing_params(self, model_type: str, preprocessing: "TextPreprocessingConfig"):
preprocessing.compute_idf = True
def can_cache_embeddings(self) -> bool:
return True
@DeveloperAPI
@register_encoder_config("llm", TEXT, model_types=[MODEL_ECD])
@ludwig_dataclass
class LLMEncoderConfig(SequenceEncoderConfig):
type: str = schema_utils.ProtectedString("llm")
base_model: str = BaseModelDataclassField()
max_sequence_length: int = schema_utils.Integer(default=None, allow_none=True, parameter_metadata=INTERNAL_ONLY)
adapter: BaseAdapterConfig | None = AdapterDataclassField()
quantization: QuantizationConfig | None = QuantizationConfigField().get_default_field()
model_parameters: ModelParametersConfig | None = ModelParametersConfigField().get_default_field()
================================================
FILE: ludwig/schema/encoders/utils.py
================================================
from dataclasses import Field
from typing import Any, TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MODEL_ECD, TYPE
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import ENCODER_METADATA
from ludwig.schema.metadata.parameter_metadata import convert_metadata_to_json
from ludwig.utils.registry import Registry
if TYPE_CHECKING:
from ludwig.schema.encoders.base import BaseEncoderConfig
encoder_config_registry = Registry()
@DeveloperAPI
def register_encoder_config(name: str, features: str | list[str], model_types: list[str] | None = None):
if model_types is None:
model_types = [MODEL_ECD]
if isinstance(features, str):
features = [features]
def wrap(cls):
for model_type in model_types:
for feature in features:
key = (model_type, feature)
feature_registry = encoder_config_registry.get(key, {})
feature_registry[name] = cls
encoder_config_registry[key] = feature_registry
return cls
return wrap
@DeveloperAPI
def get_encoder_cls(model_type: str, feature: str, name: str):
return encoder_config_registry[(model_type, feature)][name]
@DeveloperAPI
def get_encoder_classes(model_type: str, feature: str) -> dict[str, type["BaseEncoderConfig"]]:
return encoder_config_registry[(model_type, feature)]
@DeveloperAPI
def get_encoder_descriptions(model_type: str, feature_type: str) -> dict[str, Any]:
"""This function returns a dictionary of encoder descriptions available at the type selection.
The process works as follows - 1) Get a dictionary of valid encoders from the encoder config registry,
but inverse the key/value pairs since we need to index `valid_encoders` later with an altered version
of the encoder config class name. 2) Loop through Encoder Metadata entries, if a metadata entry has an
encoder name that matches a valid encoder, add the description metadata to the output dictionary.
Args:
model_type (str): The model type to get encoder descriptions for
feature_type (str): The feature type to get encoder descriptions for
Returns:
dict: A dictionary mapping encoder registered names to their respective description metadata.
"""
output = {}
valid_encoders = {
cls.module_name() if hasattr(cls, "module_name") else None: registered_name
for registered_name, cls in get_encoder_classes(model_type, feature_type).items()
}
for k, v in ENCODER_METADATA.items():
if k in valid_encoders.keys():
output[valid_encoders[k]] = convert_metadata_to_json(v[TYPE])
return output
@DeveloperAPI
def get_encoder_conds(encoder_classes: dict[str, type["BaseEncoderConfig"]]) -> list[dict[str, Any]]:
"""Returns a JSON schema of conditionals to validate against encoder types for specific feature types."""
conds = []
for encoder_type, encoder_cls in encoder_classes.items():
other_props = schema_utils.unload_jsonschema_from_marshmallow_class(encoder_cls)["properties"]
schema_utils.remove_duplicate_fields(other_props)
encoder_cond = schema_utils.create_cond(
{"type": encoder_type},
other_props,
)
conds.append(encoder_cond)
return conds
@DeveloperAPI
def EncoderDataclassField(
model_type: str, feature_type: str, default: str, description: str = "", blocklist: list[str] = []
) -> Field:
"""Custom dataclass field that when used inside a dataclass will allow the user to specify an encoder config.
Returns: Initialized dataclass field that converts an untyped dict with params to an encoder config.
"""
encoder_registry = get_encoder_classes(model_type, feature_type)
class EncoderSelection(schema_utils.TypeSelection):
def __init__(self):
super().__init__(
registry=encoder_registry, default_value=default, description=description, allow_str_value=True
)
def get_schema_from_registry(self, key: str) -> type[schema_utils.BaseMarshmallowConfig]:
return encoder_registry[key]
def _jsonschema_type_mapping(self):
# NOTE: Edit carefully if necessary! We want these enums to remain in a consistent order, so do not use sets
# or other unordered data structures to chaperone the registry keys around.
#
# Also, note the placement inside this function - since this is a list, it will not update with any late
# additions to the registry (e.g. in our tests)!
enum = [e for e in encoder_registry.keys() if e not in blocklist]
return {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": enum,
"enumDescriptions": get_encoder_descriptions(model_type, feature_type),
"default": default,
},
},
"title": "encoder_options",
"allOf": get_encoder_conds(encoder_registry),
}
return EncoderSelection().get_default_field()
================================================
FILE: ludwig/schema/export_schema.py
================================================
"""Export Ludwig config JSON schema.
Usage:
python -m ludwig.schema.export_schema [--model-type ecd|llm|combined] [--output FILE]
ludwig export_schema [--model-type ecd|llm|combined] [--output FILE]
Generates a JSON Schema (Draft 7) for Ludwig config validation.
"""
import argparse
import json
from ludwig.config_validation.validation import get_schema
from ludwig.constants import MODEL_ECD, MODEL_LLM
from ludwig.globals import LUDWIG_VERSION
SCHEMA_BASE_URL = "https://ludwig-ai.github.io/schema"
def _strip_parameter_metadata(obj):
"""Recursively remove ``parameter_metadata`` keys from a schema dict.
The Ludwig schema generator attaches ``parameter_metadata`` objects to
every field (UI display hints, suggested values, etc.). These are useful
internally but add significant bloat to the published JSON Schema and are
not relevant for validation or IDE auto-complete.
"""
if isinstance(obj, dict):
return {k: _strip_parameter_metadata(v) for k, v in obj.items() if k != "parameter_metadata"}
if isinstance(obj, list):
return [_strip_parameter_metadata(item) for item in obj]
return obj
def export_schema(model_type: str = MODEL_ECD, *, strip_metadata: bool = True) -> dict:
"""Export the full Ludwig config JSON schema for a given model type."""
schema = get_schema(model_type)
schema["$schema"] = "http://json-schema.org/draft-07/schema#"
schema["$id"] = f"{SCHEMA_BASE_URL}/ludwig-config-{model_type}.json"
schema["title"] = f"Ludwig {model_type.upper()} Configuration"
schema["description"] = f"Configuration schema for Ludwig {model_type.upper()} models (v{LUDWIG_VERSION})"
if strip_metadata:
schema = _strip_parameter_metadata(schema)
return schema
def export_combined_schema(*, strip_metadata: bool = True) -> dict:
"""Export a combined schema that covers both ECD and LLM model types."""
ecd_schema = get_schema(MODEL_ECD)
llm_schema = get_schema(MODEL_LLM)
# Merge properties from both schemas
all_properties = {}
all_properties.update(ecd_schema.get("properties", {}))
all_properties.update(llm_schema.get("properties", {}))
combined = {
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": f"{SCHEMA_BASE_URL}/ludwig-config.json",
"title": "Ludwig Configuration",
"description": f"Configuration schema for Ludwig models (v{LUDWIG_VERSION})",
"type": "object",
"properties": all_properties,
"required": ["input_features", "output_features"],
"additionalProperties": True,
}
if strip_metadata:
combined = _strip_parameter_metadata(combined)
return combined
def main(sys_argv=None):
parser = argparse.ArgumentParser(description="Export Ludwig config JSON schema")
parser.add_argument(
"--model-type",
choices=[MODEL_ECD, MODEL_LLM, "combined"],
default="combined",
help="Model type to export schema for (default: combined)",
)
parser.add_argument("--output", "-o", type=str, default=None, help="Output file (default: stdout)")
parser.add_argument(
"--full",
action="store_true",
help="Include parameter_metadata in the output (default: stripped)",
)
args = parser.parse_args(sys_argv)
strip_metadata = not args.full
if args.model_type == "combined":
schema = export_combined_schema(strip_metadata=strip_metadata)
else:
schema = export_schema(args.model_type, strip_metadata=strip_metadata)
output = json.dumps(schema, indent=2, sort_keys=False)
if args.output:
with open(args.output, "w") as f:
f.write(output)
f.write("\n")
else:
print(output)
if __name__ == "__main__":
main()
================================================
FILE: ludwig/schema/features/__init__.py
================================================
import ludwig.schema.features.audio_feature
import ludwig.schema.features.bag_feature
import ludwig.schema.features.binary_feature
import ludwig.schema.features.category_feature
import ludwig.schema.features.date_feature
import ludwig.schema.features.h3_feature
import ludwig.schema.features.image_feature
import ludwig.schema.features.number_feature
import ludwig.schema.features.sequence_feature
import ludwig.schema.features.set_feature
import ludwig.schema.features.text_feature
import ludwig.schema.features.timeseries_feature
import ludwig.schema.features.vector_feature # noqa
================================================
FILE: ludwig/schema/features/audio_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import AUDIO, MODEL_ECD
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import ecd_defaults_config_registry, ecd_input_config_registry, input_mixin_registry
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@ecd_defaults_config_registry.register(AUDIO)
@input_mixin_registry.register(AUDIO)
@ludwig_dataclass
class AudioInputFeatureConfigMixin(BaseMarshmallowConfig):
"""AudioInputFeatureConfigMixin is a dataclass that configures the parameters used in both the audio input
feature and the audio global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=AUDIO)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=AUDIO,
default="parallel_cnn",
)
@DeveloperAPI
@ecd_input_config_registry.register(AUDIO)
@ludwig_dataclass
class AudioInputFeatureConfig(AudioInputFeatureConfigMixin, BaseInputFeatureConfig):
"""AudioInputFeatureConfig is a dataclass that configures the parameters used for an audio input feature."""
type: str = schema_utils.ProtectedString(AUDIO)
================================================
FILE: ludwig/schema/features/augmentation/__init__.py
================================================
# Register all augmentation schemas
import ludwig.schema.features.augmentation.image # noqa: F401
================================================
FILE: ludwig/schema/features/augmentation/base.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class BaseAugmentationConfig(schema_utils.BaseMarshmallowConfig):
"""Base class for augmentation."""
type: str
================================================
FILE: ludwig/schema/features/augmentation/image.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import AUGMENTATION, IMAGE, TYPE
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.augmentation.base import BaseAugmentationConfig
from ludwig.schema.features.augmentation.utils import register_augmentation_config
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_augmentation_config(name="auto_augmentation", features=IMAGE)
@ludwig_dataclass
class AutoAugmentationConfig(BaseAugmentationConfig):
"""Automatic augmentation operation."""
type: str = schema_utils.ProtectedString(
"auto_augmentation",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION][TYPE],
)
method: str = schema_utils.String(
default="trivial_augment",
description="Specifies the method for applying automatic data augmentation",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION]["auto_augmentation_method"],
)
@DeveloperAPI
@register_augmentation_config(name="random_horizontal_flip", features=IMAGE)
@ludwig_dataclass
class RandomHorizontalFlipConfig(BaseAugmentationConfig):
"""Random horizontal flip augmentation operation."""
type: str = schema_utils.ProtectedString(
"random_horizontal_flip",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION][TYPE],
)
@DeveloperAPI
@register_augmentation_config(name="random_vertical_flip", features=IMAGE)
@ludwig_dataclass
class RandomVerticalFlipConfig(BaseAugmentationConfig):
"""Random vertical flip augmentation operation."""
type: str = schema_utils.ProtectedString(
"random_vertical_flip",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION][TYPE],
)
@DeveloperAPI
@register_augmentation_config(name="random_rotate", features=IMAGE)
@ludwig_dataclass
class RandomRotateConfig(BaseAugmentationConfig):
"""Random rotation augmentation operation."""
type: str = schema_utils.ProtectedString(
"random_rotate",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION]["type"],
)
degree: int = schema_utils.Integer(
default=15,
description="Range of angle for random rotation, i.e., [-degree, +degree].",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION]["rotation_degree"],
)
@DeveloperAPI
@register_augmentation_config(name="random_blur", features=IMAGE)
@ludwig_dataclass
class RandomBlurConfig(BaseAugmentationConfig):
"""Random blur augmentation operation."""
type: str = schema_utils.ProtectedString(
"random_blur",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION][TYPE],
)
kernel_size: int = schema_utils.Integer(
default=3,
description="Kernel size for random blur.",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION]["kernel_size"],
)
@DeveloperAPI
@register_augmentation_config(name="random_brightness", features=IMAGE)
@ludwig_dataclass
class RandomBrightnessConfig(BaseAugmentationConfig):
"""Random brightness augmentation operation."""
type: str = schema_utils.ProtectedString(
"random_brightness",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION][TYPE],
)
min: float = schema_utils.FloatRange(
default=0.5,
description="Minimum factor for random brightness.",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION]["min_brightness"],
)
max: float = schema_utils.FloatRange(
default=2.0,
description="Maximum factor for random brightness.",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION]["max_brightness"],
)
@DeveloperAPI
@register_augmentation_config(name="random_contrast", features=IMAGE)
@ludwig_dataclass
class RandomContrastConfig(BaseAugmentationConfig):
"""Random Contrast augmentation operation."""
type: str = schema_utils.ProtectedString(
"random_contrast",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION][TYPE],
)
min: float = schema_utils.FloatRange(
default=0.5,
description="Minimum factor for random contrast.",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION]["min_contrast"],
)
max: float = schema_utils.FloatRange(
default=2.0,
description="Maximum factor for random contrast.",
parameter_metadata=FEATURE_METADATA[IMAGE][AUGMENTATION]["max_contrast"],
)
================================================
FILE: ludwig/schema/features/augmentation/utils.py
================================================
import copy
from dataclasses import field
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import TYPE
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.augmentation.base import BaseAugmentationConfig
from ludwig.utils.registry import Registry
_augmentation_config_registry = Registry()
@DeveloperAPI
def get_augmentation_config_registry() -> Registry:
return _augmentation_config_registry
@DeveloperAPI
def register_augmentation_config(name: str, features: str | list[str]):
if isinstance(features, str):
features = [features]
def wrap(cls):
for feature in features:
augmentation_registry = get_augmentation_config_registry().get(feature, {})
augmentation_registry[name] = cls
get_augmentation_config_registry()[feature] = augmentation_registry
return cls
return wrap
@DeveloperAPI
def get_augmentation_cls(feature: str, name: str):
return get_augmentation_config_registry()[feature][name]
@DeveloperAPI
def get_augmentation_classes(feature: str):
return get_augmentation_config_registry()[feature]
@DeveloperAPI
def AugmentationDataclassField(
feature_type: str,
default: str | BaseAugmentationConfig = False,
default_augmentations: list[BaseAugmentationConfig] | None = None,
description: str = "",
):
"""Custom dataclass field that when used inside a dataclass will allow the user to specify an augmentation
config.
Args:
default: The default augmentation config to use.
default_augmentations: The default list of augmentations to use when param value is set to `True`.
description: The description of the augmentation config.
Returns: Initialized dataclass field that converts a list with params to an augmentation config.
"""
default_augmentations = default_augmentations or []
default_augmentations = [a.to_dict() for a in default_augmentations]
if isinstance(default, bool):
default = default_augmentations if default else []
class AugmentationContainerMarshmallowField(schema_utils.LudwigSchemaField):
"""Custom field that deserializes a list for a valid augmentation config from the augmentation_registry and
creates a corresponding JSON schema for external usage."""
def _deserialize(self, value, attr, data, **kwargs):
if isinstance(value, bool):
value = default_augmentations if value else []
if not isinstance(value, list):
raise ConfigValidationError(f"Augmentation config must be a list, found: {type(value)}")
augmentation_classes = get_augmentation_classes(feature_type)
augmentation_list = []
for augmentation in value:
augmentation_op = augmentation[TYPE]
if augmentation_op in augmentation_classes:
augmentation_cls = augmentation_classes[augmentation_op]
pre = augmentation_cls()
try:
augmentation_list.append(pre.Schema().load(augmentation))
except (TypeError, ConfigValidationError) as error:
raise ConfigValidationError(
f"Invalid augmentation params: {value}, see `{pre}` definition. Error: {error}"
)
else:
raise ConfigValidationError(
f"Invalid augmentation type: '{augmentation_op}', "
f"expected one of: {list(augmentation_classes.keys())}"
)
return augmentation_list
def _jsonschema_type_mapping(self):
return get_augmentation_list_jsonschema(feature_type, default)
try:
assert isinstance(default, list), "Augmentation config must be a list."
load_augmentation_list = []
dump_augmentation_list = []
for augmentation in default:
augmentation_op = augmentation[TYPE]
augmentation_cls = get_augmentation_cls(feature_type, augmentation_op)
pre = augmentation_cls()
try:
load_augmentation_list.append(pre.Schema().load(augmentation))
dump_augmentation_list.append(pre.Schema().dump(augmentation))
except (TypeError, ConfigValidationError) as error:
raise ConfigValidationError(
f"Invalid augmentation params: {default}, see `{pre}` definition. Error: {error}"
)
load_default = lambda: copy.deepcopy(load_augmentation_list)
dump_default = dump_augmentation_list
return field(
metadata={
"marshmallow_field": AugmentationContainerMarshmallowField(
allow_none=False,
dump_default=dump_default,
load_default=load_default,
)
},
default_factory=load_default,
)
except Exception as e:
raise ConfigValidationError(f"Unsupported augmentation type. See augmentation_registry. " f"Details: {e}")
@DeveloperAPI
def get_augmentation_list_jsonschema(feature_type: str, default: list[dict[str, Any]]):
"""This function returns a JSON augmentation schema.
Returns: JSON Schema
"""
augmentation_types = sorted(list(get_augmentation_config_registry()[feature_type].keys()))
schema = {
"oneOf": [
{
"type": "array",
"items": {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": augmentation_types,
"title": "type",
"description": "Type of augmentation to apply.",
},
},
"additionalProperties": True,
"allOf": get_augmentation_list_conds(feature_type),
"required": ["type"],
},
"title": "array_option",
},
{"type": "boolean", "description": "Apply standard augmentation pipeline.", "title": "boolean_option"},
],
"title": "augmentation",
}
return schema
@DeveloperAPI
def get_augmentation_list_conds(feature_type: str):
"""This function returns a list of if-then JSON clauses for each augmentation type along with their properties
and constraints.
Returns: List of JSON clauses
"""
conds = []
for augmentation_op in get_augmentation_classes(feature_type):
schema_cls = get_augmentation_cls(feature_type, augmentation_op)
augmentation_schema = schema_utils.unload_jsonschema_from_marshmallow_class(schema_cls)
augmentation_props = augmentation_schema["properties"]
schema_utils.remove_duplicate_fields(augmentation_props)
augmentation_cond = schema_utils.create_cond({"type": augmentation_op}, augmentation_props)
conds.append(augmentation_cond)
return conds
================================================
FILE: ludwig/schema/features/bag_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BAG, MODEL_ECD
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import ecd_defaults_config_registry, ecd_input_config_registry, input_mixin_registry
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@ecd_defaults_config_registry.register(BAG)
@input_mixin_registry.register(BAG)
@ludwig_dataclass
class BagInputFeatureConfigMixin(BaseMarshmallowConfig):
"""BagInputFeatureConfigMixin is a dataclass that configures the parameters used in both the bag input feature
and the bag global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=BAG)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=BAG,
default="embed",
)
@DeveloperAPI
@ecd_input_config_registry.register(BAG)
@ludwig_dataclass
class BagInputFeatureConfig(BagInputFeatureConfigMixin, BaseInputFeatureConfig):
"""BagInputFeatureConfig is a dataclass that configures the parameters used for a bag input feature."""
type: str = schema_utils.ProtectedString(BAG)
================================================
FILE: ludwig/schema/features/base.py
================================================
import logging
from collections.abc import Iterable
from dataclasses import field
from typing import Any, Generic, TypeVar
from rich.console import Console
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
AUDIO,
BAG,
BINARY,
CATEGORY,
DATE,
H3,
IMAGE,
MODEL_ECD,
MODEL_LLM,
NUMBER,
SEQUENCE,
SET,
TEXT,
TIMESERIES,
VECTOR,
)
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.utils import (
ecd_input_config_registry,
ecd_output_config_registry,
get_input_feature_jsonschema,
get_output_feature_jsonschema,
llm_input_config_registry,
llm_output_config_registry,
)
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY, ParameterMetadata
from ludwig.schema.utils import ludwig_dataclass
logger = logging.getLogger(__name__)
_error_console = Console(stderr=True, style="bold red")
_info_console = Console(stderr=True, style="bold green")
@DeveloperAPI
@ludwig_dataclass
class BaseFeatureConfig(schema_utils.BaseMarshmallowConfig):
"""Base class for feature configs."""
def __post_init__(self):
# TODO(travis): this should be done through marshmallow dataclass' `required` field param,
# but requires a refactor`
if self.name is None:
raise ConfigValidationError("All features must have a name.")
if self.type is None:
raise ConfigValidationError(f"Feature {self.name} must have a type.")
active: bool = True
name: str = schema_utils.String(
default=None,
allow_none=True,
description="Name of the feature.",
)
type: str = schema_utils.StringOptions(
default=None,
allow_none=True,
options=[AUDIO, BAG, BINARY, CATEGORY, DATE, H3, IMAGE, NUMBER, SEQUENCE, SET, TEXT, TIMESERIES, VECTOR],
description="Type of the feature.",
)
column: str = schema_utils.String(
allow_none=True,
default=None,
description="The column name of this feature. Defaults to name if not specified.",
)
proc_column: str = schema_utils.String(
allow_none=True,
default=None,
description="The name of the preprocessed column name of this feature. Internal only.",
parameter_metadata=ParameterMetadata(internal_only=True),
)
def enable(self):
"""This function allows the user to specify which features from a dataset should be included during model
training. This is the equivalent to toggling on a feature in the model creation UI.
Returns:
None
"""
if self.active:
_error_console.print("This feature is already enabled!")
else:
self.active = True
_info_console.print(f"{self.name} feature enabled!\n")
logger.info(self.__repr__())
def disable(self):
"""This function allows the user to specify which features from a dataset should not be included during
model training. This is the equivalent to toggling off a feature in the model creation UI.
Returns:
None
"""
if not self.active:
_error_console.print("This feature is already disabled!")
else:
self.active = False
_info_console.print(f"{self.name} feature disabled!\n")
logger.info(self.__repr__())
@DeveloperAPI
@ludwig_dataclass
class BaseInputFeatureConfig(BaseFeatureConfig):
"""Base input feature config class."""
tied: str = schema_utils.String(
default=None,
allow_none=True,
description="Name of input feature to tie the weights of the encoder with. It needs to be the name of a "
"feature of the same type and with the same encoder parameters. If text or sequence features are tied, "
"consider setting the `sequence_length` parameter in `preprocessing` to ensure that the tied features have "
"equal sized outputs. This is necessary when using the `sequence` combiner.",
)
def has_augmentation(self) -> bool:
return False
@DeveloperAPI
@ludwig_dataclass
class ECDInputFeatureConfig(BaseFeatureConfig):
pass
@DeveloperAPI
@ludwig_dataclass
class BaseOutputFeatureConfig(BaseFeatureConfig):
"""Base output feature config class."""
reduce_input: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
)
default_validation_metric: str = schema_utils.String(
default=None,
allow_none=True,
description="Internal only use parameter: default validation metric for output feature.",
parameter_metadata=INTERNAL_ONLY,
)
dependencies: list[str] = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
)
reduce_dependencies: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce the dependencies of the output feature.",
)
input_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Size of the input to the decoder.",
parameter_metadata=ParameterMetadata(internal_only=True),
)
num_classes: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Size of the input to the decoder.",
parameter_metadata=ParameterMetadata(internal_only=True),
)
T = TypeVar("T", bound=BaseFeatureConfig)
class FeatureCollection(Generic[T], schema_utils.ListSerializable):
def __init__(self, features: list[T]):
self._features = features
self._name_to_feature = {f.name: f for f in features}
for k, v in self._name_to_feature.items():
setattr(self, k, v)
def to_list(self) -> list[dict[str, Any]]:
out_list = []
for feature in self._features:
out_list.append(feature.to_dict())
return out_list
def items(self) -> Iterable[tuple[str, T]]:
return self._name_to_feature.items()
def __iter__(self):
return iter(self._features)
def __len__(self):
return len(self._features)
def __getitem__(self, i) -> T:
if isinstance(i, str):
return self._name_to_feature[i]
else:
return self._features[i]
class FeatureList(schema_utils.LudwigSchemaField):
"""A schema field that deserializes a list of dicts into a FeatureCollection.
Each item is resolved via the inner TypeSelection's resolve() method.
"""
def __init__(
self,
inner: schema_utils.TypeSelection,
min_length: int | None = None,
max_length: int | None = None,
equal: int | None = None,
metadata: dict | None = None,
):
self.inner = inner
self.min_length = min_length
self.max_length = max_length
self.equal = equal
self.metadata = metadata or {}
def _deserialize(self, value, attr, data, **kwargs) -> FeatureCollection:
if not isinstance(value, list):
raise ConfigValidationError(f"Expected a list of features for '{attr}', got {type(value).__name__}")
# Validate length constraints
n = len(value)
if self.equal is not None and n != self.equal:
raise ConfigValidationError(f"Expected exactly {self.equal} feature(s) for '{attr}', got {n}")
if self.min_length is not None and n < self.min_length:
raise ConfigValidationError(f"Expected at least {self.min_length} feature(s) for '{attr}', got {n}")
if self.max_length is not None and n > self.max_length:
raise ConfigValidationError(f"Expected at most {self.max_length} feature(s) for '{attr}', got {n}")
feature_list = [self.inner.resolve(item) for item in value]
return FeatureCollection(feature_list)
def _jsonschema_type_mapping(self):
inner_schema = self.inner._jsonschema_type_mapping() or {}
result = {"type": "array", "items": inner_schema}
if self.min_length is not None:
result["minItems"] = self.min_length
if self.max_length is not None:
result["maxItems"] = self.max_length
if self.equal is not None:
result["minItems"] = self.equal
result["maxItems"] = self.equal
return result
class FeaturesTypeSelection(schema_utils.TypeSelection):
def __init__(
self,
*args,
min_length: int | None = 1,
max_length: int | None = None,
supplementary_metadata=None,
**kwargs,
):
super().__init__(*args, **kwargs)
self.min_length = min_length
self.max_length = max_length
self.supplementary_metadata = {} if supplementary_metadata is None else supplementary_metadata
def get_list_field(self):
min_length = self.min_length
max_length = self.max_length
equal = None
if min_length == max_length:
min_length = None
max_length = None
equal = self.max_length
return field(
metadata={
"marshmallow_field": FeatureList(
self,
min_length=min_length,
max_length=max_length,
equal=equal,
metadata=self.supplementary_metadata,
)
},
)
class ECDInputFeatureSelection(FeaturesTypeSelection):
def __init__(self):
super().__init__(
registry=ecd_input_config_registry,
description="Type of the input feature",
supplementary_metadata={"uniqueItemProperties": ["name"]},
)
def _jsonschema_type_mapping(self):
return get_input_feature_jsonschema(MODEL_ECD)
class LLMInputFeatureSelection(FeaturesTypeSelection):
def __init__(self):
super().__init__(registry=llm_input_config_registry, description="Type of the input feature")
def _jsonschema_type_mapping(self):
return get_input_feature_jsonschema(MODEL_LLM)
class ECDOutputFeatureSelection(FeaturesTypeSelection):
def __init__(self):
super().__init__(registry=ecd_output_config_registry, description="Type of the output feature")
def _jsonschema_type_mapping(self):
return get_output_feature_jsonschema(MODEL_ECD)
class LLMOutputFeatureSelection(FeaturesTypeSelection):
def __init__(self):
# TODO(Arnav): Remove the hard check on max_length once we support multiple output features.
super().__init__(max_length=1, registry=llm_output_config_registry, description="Type of the output feature")
def _jsonschema_type_mapping(self):
return get_output_feature_jsonschema(MODEL_LLM)
================================================
FILE: ludwig/schema/features/binary_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BINARY, BINARY_WEIGHTED_CROSS_ENTROPY, MODEL_ECD, ROC_AUC
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import DecoderDataclassField
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.loss.loss import BaseLossConfig
from ludwig.schema.features.loss.utils import LossDataclassField
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
output_mixin_registry,
)
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@input_mixin_registry.register(BINARY)
@ludwig_dataclass
class BinaryInputFeatureConfigMixin(BaseMarshmallowConfig):
"""BinaryInputFeatureConfigMixin is a dataclass that configures the parameters used in both the binary input
feature and the binary global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=BINARY)
@DeveloperAPI
@ludwig_dataclass
class BinaryInputFeatureConfig(BinaryInputFeatureConfigMixin, BaseInputFeatureConfig):
"""BinaryInputFeatureConfig is a dataclass that configures the parameters used for a binary input feature."""
type: str = schema_utils.ProtectedString(BINARY)
encoder: BaseEncoderConfig = None
@DeveloperAPI
@ecd_input_config_registry.register(BINARY)
@ludwig_dataclass
class ECDBinaryInputFeatureConfig(BinaryInputFeatureConfig):
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=BINARY,
default="passthrough",
)
@DeveloperAPI
@output_mixin_registry.register(BINARY)
@ludwig_dataclass
class BinaryOutputFeatureConfigMixin(BaseMarshmallowConfig):
"""BinaryOutputFeatureConfigMixin is a dataclass that configures the parameters used in both the binary output
feature and the binary global defaults section of the Ludwig Config."""
decoder: BaseDecoderConfig = None
loss: BaseLossConfig = LossDataclassField(
feature_type=BINARY,
default=BINARY_WEIGHTED_CROSS_ENTROPY,
)
@DeveloperAPI
@ludwig_dataclass
class BinaryOutputFeatureConfig(BinaryOutputFeatureConfigMixin, BaseOutputFeatureConfig):
"""BinaryOutputFeatureConfig is a dataclass that configures the parameters used for a binary output feature."""
type: str = schema_utils.ProtectedString(BINARY)
calibration: bool = schema_utils.Boolean(
default=False,
description="Calibrate the model's output probabilities using temperature scaling.",
parameter_metadata=FEATURE_METADATA[BINARY]["calibration"],
)
default_validation_metric: str = schema_utils.StringOptions(
[ROC_AUC],
default=ROC_AUC,
description="Internal only use parameter: default validation metric for binary output feature.",
parameter_metadata=INTERNAL_ONLY,
)
dependencies: list = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
parameter_metadata=FEATURE_METADATA[BINARY]["dependencies"],
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="binary_output")
reduce_dependencies: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce the dependencies of the output feature.",
parameter_metadata=FEATURE_METADATA[BINARY]["reduce_dependencies"],
)
reduce_input: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=FEATURE_METADATA[BINARY]["reduce_input"],
)
threshold: float = schema_utils.FloatRange(
default=0.5,
min=0,
max=1,
description="The threshold used to convert output probabilities to predictions. Predicted probabilities greater"
"than or equal to threshold are mapped to True.",
parameter_metadata=FEATURE_METADATA[BINARY]["threshold"],
)
@DeveloperAPI
@ecd_output_config_registry.register(BINARY)
@ludwig_dataclass
class ECDBinaryOutputFeatureConfig(BinaryOutputFeatureConfig):
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=BINARY,
default="regressor",
)
@DeveloperAPI
@ecd_defaults_config_registry.register(BINARY)
@ludwig_dataclass
class BinaryDefaultsConfig(BinaryInputFeatureConfigMixin, BinaryOutputFeatureConfigMixin):
# NOTE(travis): defaults use ECD input feature as it contains all the encoders
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=BINARY,
default="passthrough",
)
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=BINARY,
default="regressor",
)
================================================
FILE: ludwig/schema/features/category_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ACCURACY, CATEGORY, CATEGORY_DISTRIBUTION, MODEL_ECD, MODEL_LLM, SOFTMAX_CROSS_ENTROPY
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import DecoderDataclassField
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.loss.loss import BaseLossConfig
from ludwig.schema.features.loss.utils import LossDataclassField
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
llm_output_config_registry,
output_mixin_registry,
)
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@input_mixin_registry.register(CATEGORY)
@ludwig_dataclass
class CategoryInputFeatureConfigMixin(BaseMarshmallowConfig):
"""CategoryInputFeatureConfigMixin is a dataclass that configures the parameters used in both the category
input feature and the category global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=CATEGORY)
@DeveloperAPI
@ludwig_dataclass
class CategoryInputFeatureConfig(CategoryInputFeatureConfigMixin, BaseInputFeatureConfig):
"""CategoryInputFeatureConfig is a dataclass that configures the parameters used for a category input
feature."""
type: str = schema_utils.ProtectedString(CATEGORY)
encoder: BaseEncoderConfig = None
@DeveloperAPI
@ecd_input_config_registry.register(CATEGORY)
@ludwig_dataclass
class ECDCategoryInputFeatureConfig(CategoryInputFeatureConfig):
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=CATEGORY,
default="dense",
)
@DeveloperAPI
@output_mixin_registry.register(CATEGORY)
@ludwig_dataclass
class CategoryOutputFeatureConfigMixin(BaseMarshmallowConfig):
"""CategoryOutputFeatureConfigMixin is a dataclass that configures the parameters used in both the category
output feature and the category global defaults section of the Ludwig Config."""
decoder: BaseDecoderConfig = None
loss: BaseLossConfig = LossDataclassField(
feature_type=CATEGORY,
default=SOFTMAX_CROSS_ENTROPY,
)
@DeveloperAPI
@ludwig_dataclass
class CategoryOutputFeatureConfig(CategoryOutputFeatureConfigMixin, BaseOutputFeatureConfig):
"""CategoryOutputFeatureConfig is a dataclass that configures the parameters used for a category output
feature."""
type: str = schema_utils.ProtectedString(CATEGORY)
calibration: bool = schema_utils.Boolean(
default=False,
description="Calibrate the model's output probabilities using temperature scaling.",
parameter_metadata=FEATURE_METADATA[CATEGORY]["calibration"],
)
default_validation_metric: str = schema_utils.StringOptions(
[ACCURACY],
default=ACCURACY,
description="Internal only use parameter: default validation metric for category output feature.",
parameter_metadata=INTERNAL_ONLY,
)
dependencies: list = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
parameter_metadata=FEATURE_METADATA[CATEGORY]["dependencies"],
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="category_output")
reduce_dependencies: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce the dependencies of the output feature.",
parameter_metadata=FEATURE_METADATA[CATEGORY]["reduce_dependencies"],
)
reduce_input: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=FEATURE_METADATA[CATEGORY]["reduce_input"],
)
top_k: int = schema_utils.PositiveInteger(
default=3,
description="Determines the parameter k, the number of categories to consider when computing the top_k "
"measure. It computes accuracy but considering as a match if the true category appears in the "
"first k predicted categories ranked by decoder's confidence.",
parameter_metadata=FEATURE_METADATA[CATEGORY]["top_k"],
)
@DeveloperAPI
@ecd_output_config_registry.register(CATEGORY)
@ludwig_dataclass
class ECDCategoryOutputFeatureConfig(CategoryOutputFeatureConfig):
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=CATEGORY,
default="classifier",
)
@DeveloperAPI
@ecd_output_config_registry.register(CATEGORY_DISTRIBUTION)
@ludwig_dataclass
class CategoryDistributionOutputFeatureConfig(CategoryOutputFeatureConfig):
"""CategoryDistributionOutputFeatureConfig is a dataclass that configures the parameters used for a
category_distribution output feature."""
type: str = schema_utils.ProtectedString(CATEGORY_DISTRIBUTION)
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=CATEGORY,
default="classifier",
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="category_distribution_output")
@DeveloperAPI
@ecd_defaults_config_registry.register(CATEGORY)
@ludwig_dataclass
class CategoryDefaultsConfig(CategoryInputFeatureConfigMixin, CategoryOutputFeatureConfigMixin):
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=CATEGORY,
default="dense",
)
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=CATEGORY,
default="classifier",
)
@DeveloperAPI
@ecd_defaults_config_registry.register(CATEGORY_DISTRIBUTION)
@ludwig_dataclass
class CategoryDistributionDefaultsConfig(CategoryOutputFeatureConfigMixin):
pass
@DeveloperAPI
@llm_output_config_registry.register(CATEGORY)
@ludwig_dataclass
class LLMCategoryOutputFeatureConfig(CategoryOutputFeatureConfig):
"""LLMCategoryOutputFeatureConfig is a dataclass that configures the parameters used for a category output
feature when using the Ludwig Light Model."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="category_llm")
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_LLM,
feature_type=CATEGORY,
default="category_extractor",
)
================================================
FILE: ludwig/schema/features/date_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DATE, MODEL_ECD
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import ecd_defaults_config_registry, ecd_input_config_registry, input_mixin_registry
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@ecd_defaults_config_registry.register(DATE)
@input_mixin_registry.register(DATE)
@ludwig_dataclass
class DateInputFeatureConfigMixin(BaseMarshmallowConfig):
"""DateInputFeatureConfigMixin is a dataclass that configures the parameters used in both the date input
feature and the date global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=DATE)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=DATE,
default="embed",
)
@DeveloperAPI
@ecd_input_config_registry.register(DATE)
@ludwig_dataclass
class DateInputFeatureConfig(DateInputFeatureConfigMixin, BaseInputFeatureConfig):
"""DateInputFeature is a dataclass that configures the parameters used for a date input feature."""
type: str = schema_utils.ProtectedString(DATE)
================================================
FILE: ludwig/schema/features/h3_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import H3, MODEL_ECD
from ludwig.schema import utils as schema_utils
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import ecd_defaults_config_registry, ecd_input_config_registry, input_mixin_registry
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@ecd_defaults_config_registry.register(H3)
@input_mixin_registry.register(H3)
@ludwig_dataclass
class H3InputFeatureConfigMixin(BaseMarshmallowConfig):
"""H3InputFeatureConfigMixin is a dataclass that configures the parameters used in both the h3 input feature
and the h3 global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=H3)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=H3,
default="embed",
)
@DeveloperAPI
@ecd_input_config_registry.register(H3)
@ludwig_dataclass
class H3InputFeatureConfig(H3InputFeatureConfigMixin, BaseInputFeatureConfig):
"""H3InputFeatureConfig is a dataclass that configures the parameters used for an h3 input feature."""
type: str = schema_utils.ProtectedString(H3)
================================================
FILE: ludwig/schema/features/image_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import IMAGE, LOSS, MODEL_ECD, SOFTMAX_CROSS_ENTROPY
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import DecoderDataclassField
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.augmentation.base import BaseAugmentationConfig
from ludwig.schema.features.augmentation.image import RandomHorizontalFlipConfig, RandomRotateConfig
from ludwig.schema.features.augmentation.utils import AugmentationDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.loss.loss import BaseLossConfig
from ludwig.schema.features.loss.utils import LossDataclassField
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
output_mixin_registry,
)
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
# Augmentation operations when augmentation is set to True
AUGMENTATION_DEFAULT_OPERATIONS = [
RandomHorizontalFlipConfig(),
RandomRotateConfig(),
]
@DeveloperAPI
@ecd_defaults_config_registry.register(IMAGE)
@input_mixin_registry.register(IMAGE)
@ludwig_dataclass
class ImageInputFeatureConfigMixin(BaseMarshmallowConfig):
"""ImageInputFeatureConfigMixin is a dataclass that configures the parameters used in both the image input
feature and the image global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=IMAGE)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=IMAGE,
default="stacked_cnn",
)
augmentation: list[BaseAugmentationConfig] = AugmentationDataclassField(
feature_type=IMAGE,
default=False,
default_augmentations=AUGMENTATION_DEFAULT_OPERATIONS,
description="Augmentation operation configuration.",
)
def has_augmentation(self) -> bool:
# Check for None, False, and []
return bool(self.augmentation)
@DeveloperAPI
@ecd_input_config_registry.register(IMAGE)
@ludwig_dataclass
class ImageInputFeatureConfig(ImageInputFeatureConfigMixin, BaseInputFeatureConfig):
"""ImageInputFeatureConfig is a dataclass that configures the parameters used for an image input feature."""
type: str = schema_utils.ProtectedString(IMAGE)
@DeveloperAPI
@output_mixin_registry.register(IMAGE)
@ludwig_dataclass
class ImageOutputFeatureConfigMixin(BaseMarshmallowConfig):
"""ImageOutputFeatureConfigMixin is a dataclass that configures the parameters used in both the image output
feature and the image global defaults section of the Ludwig Config."""
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=IMAGE,
default="unet",
)
loss: BaseLossConfig = LossDataclassField(
feature_type=IMAGE,
default=SOFTMAX_CROSS_ENTROPY,
)
@DeveloperAPI
@ecd_output_config_registry.register(IMAGE)
@ludwig_dataclass
class ImageOutputFeatureConfig(ImageOutputFeatureConfigMixin, BaseOutputFeatureConfig):
"""ImageOutputFeatureConfig is a dataclass that configures the parameters used for an image output feature."""
type: str = schema_utils.ProtectedString(IMAGE)
dependencies: list = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
parameter_metadata=FEATURE_METADATA[IMAGE]["dependencies"],
)
default_validation_metric: str = schema_utils.StringOptions(
[LOSS],
default=LOSS,
description="Internal only use parameter: default validation metric for image output feature.",
parameter_metadata=INTERNAL_ONLY,
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="image_output")
reduce_dependencies: str = schema_utils.ReductionOptions(
default=None,
description="How to reduce the dependencies of the output feature.",
parameter_metadata=FEATURE_METADATA[IMAGE]["reduce_dependencies"],
)
reduce_input: str = schema_utils.ReductionOptions(
default=None,
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=FEATURE_METADATA[IMAGE]["reduce_input"],
)
@DeveloperAPI
@ecd_defaults_config_registry.register(IMAGE)
@ludwig_dataclass
class ImageDefaultsConfig(ImageInputFeatureConfigMixin, ImageOutputFeatureConfigMixin):
pass
================================================
FILE: ludwig/schema/features/loss/__init__.py
================================================
from ludwig.schema.features.loss.loss import get_loss_classes, get_loss_cls, get_loss_schema_registry # noqa
================================================
FILE: ludwig/schema/features/loss/loss.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
BINARY,
BINARY_WEIGHTED_CROSS_ENTROPY,
CATEGORY,
CORN,
HUBER,
IMAGE,
MEAN_ABSOLUTE_ERROR,
MEAN_ABSOLUTE_PERCENTAGE_ERROR,
MEAN_SQUARED_ERROR,
NEXT_TOKEN_SOFTMAX_CROSS_ENTROPY,
NUMBER,
ROOT_MEAN_SQUARED_ERROR,
ROOT_MEAN_SQUARED_PERCENTAGE_ERROR,
SEQUENCE,
SEQUENCE_SOFTMAX_CROSS_ENTROPY,
SET,
SIGMOID_CROSS_ENTROPY,
SOFTMAX_CROSS_ENTROPY,
TEXT,
TIMESERIES,
VECTOR,
)
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import LOSS_METADATA
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils.registry import Registry
ROBUST_LAMBDA_DESCRIPTION = (
"Replaces the loss with `(1 - robust_lambda) * loss + robust_lambda / c` where `c` is the number of "
"classes. Useful in case of noisy labels."
)
CONFIDENCE_PENALTY_DESCRIPTION = (
"Penalizes overconfident predictions (low entropy) by adding an additional term "
"that penalizes too confident predictions by adding a `a * (max_entropy - entropy) / max_entropy` "
"term to the loss, where a is the value of this parameter. Useful in case of noisy labels."
)
CLASS_WEIGHTS_DESCRIPTION = (
"Weights to apply to each class in the loss. If not specified, all classes are weighted equally. "
"The value can be a vector of weights, one for each class, that is multiplied to the "
"loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in "
"case of unbalanced class distribution. The ordering of the vector follows the category to integer ID "
"mapping in the JSON metadata file (the `` class needs to be included too). Alternatively, the value "
"can be a dictionary with class strings as keys and weights as values, like "
"`{class_a: 0.5, class_b: 0.7, ...}`."
)
CLASS_SIMILARITIES_DESCRIPTION = (
"If not `null` it is a `c x c` matrix in the form of a list of lists that contains the mutual similarity of "
"classes. It is used if `class_similarities_temperature` is greater than 0. The ordering of the vector follows "
"the category to integer ID mapping in the JSON metadata file (the `` class needs to be included too)."
)
CLASS_SIMILARITIES_TEMPERATURE_DESCRIPTION = (
"The temperature parameter of the softmax that is performed on each row of `class_similarities`. The output of "
"that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be "
"provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more "
"tolerable than errors between really different classes."
)
@DeveloperAPI
@ludwig_dataclass
class BaseLossConfig(schema_utils.BaseMarshmallowConfig):
"""Base class for feature configs."""
type: str
weight: float = 1.0
@classmethod
def name(cls) -> str:
return "[undefined]"
_loss_registry = Registry[type[BaseLossConfig]]()
_loss_feature_registry = Registry[dict[str, type[BaseLossConfig]]]()
@DeveloperAPI
def get_loss_schema_registry() -> Registry[type[BaseLossConfig]]:
return _loss_registry
@DeveloperAPI
def get_loss_cls(feature: str, name: str) -> type[BaseLossConfig]:
return _loss_feature_registry[feature][name]
@DeveloperAPI
def get_loss_classes(feature: str) -> dict[str, type[BaseLossConfig]]:
return _loss_feature_registry[feature]
def register_loss(features: str | list[str]):
if isinstance(features, str):
features = [features]
def wrap(cls: type[BaseLossConfig]):
_loss_registry[cls.type] = cls
for feature in features:
feature_registry = _loss_feature_registry.get(feature, {})
feature_registry[cls.type] = cls
_loss_feature_registry[feature] = feature_registry
return cls
return wrap
@DeveloperAPI
@register_loss([NUMBER, TIMESERIES, VECTOR])
@ludwig_dataclass
class MSELossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
MEAN_SQUARED_ERROR,
description="Type of loss.",
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["MSELoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Mean Squared Error (MSE)"
@DeveloperAPI
@register_loss([NUMBER, TIMESERIES, VECTOR])
@ludwig_dataclass
class MAELossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
MEAN_ABSOLUTE_ERROR,
description="Type of loss.",
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["MAELoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Mean Absolute Error (MAE)"
@DeveloperAPI
@register_loss([NUMBER, TIMESERIES, VECTOR])
@ludwig_dataclass
class MAPELossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
MEAN_ABSOLUTE_PERCENTAGE_ERROR,
description="Type of loss.",
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["MAELoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Mean Absolute Percentage Error (MAPE)"
@DeveloperAPI
@register_loss([NUMBER])
@ludwig_dataclass
class RMSELossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
ROOT_MEAN_SQUARED_ERROR,
description="Type of loss.",
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["RMSELoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Root Mean Squared Error (RMSE)"
@DeveloperAPI
@register_loss([NUMBER])
@ludwig_dataclass
class RMSPELossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
ROOT_MEAN_SQUARED_PERCENTAGE_ERROR,
description="Type of loss.",
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["RMSPELoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Root Mean Squared Percentage Error (RMSPE)"
@DeveloperAPI
@register_loss([BINARY])
@ludwig_dataclass
class BWCEWLossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
BINARY_WEIGHTED_CROSS_ENTROPY,
description="Type of loss.",
)
positive_class_weight: float = schema_utils.NonNegativeFloat(
default=None,
allow_none=True,
description="Weight of the positive class.",
parameter_metadata=LOSS_METADATA["BWCEWLoss"]["positive_class_weight"],
)
robust_lambda: int = schema_utils.NonNegativeInteger(
default=0,
description=ROBUST_LAMBDA_DESCRIPTION,
parameter_metadata=LOSS_METADATA["BWCEWLoss"]["robust_lambda"],
)
confidence_penalty: float = schema_utils.NonNegativeFloat(
default=0,
description=CONFIDENCE_PENALTY_DESCRIPTION,
parameter_metadata=LOSS_METADATA["BWCEWLoss"]["confidence_penalty"],
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["BWCEWLoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Binary Weighted Cross Entropy (BWCE)"
@DeveloperAPI
@register_loss([CATEGORY, VECTOR, IMAGE])
@ludwig_dataclass
class SoftmaxCrossEntropyLossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
SOFTMAX_CROSS_ENTROPY,
description="Type of loss.",
)
class_weights: list[float] | dict | None = schema_utils.OneOfOptionsField(
default=None,
description=CLASS_WEIGHTS_DESCRIPTION,
field_options=[
schema_utils.Dict(default=None, allow_none=True),
schema_utils.List(list_type=float, allow_none=False),
],
parameter_metadata=LOSS_METADATA["SoftmaxCrossEntropyLoss"]["class_weights"],
)
robust_lambda: int = schema_utils.NonNegativeInteger(
default=0,
description=ROBUST_LAMBDA_DESCRIPTION,
parameter_metadata=LOSS_METADATA["SoftmaxCrossEntropyLoss"]["robust_lambda"],
)
confidence_penalty: float = schema_utils.NonNegativeFloat(
default=0,
description=CONFIDENCE_PENALTY_DESCRIPTION,
parameter_metadata=LOSS_METADATA["SoftmaxCrossEntropyLoss"]["confidence_penalty"],
)
class_similarities: list = schema_utils.List(
list,
default=None,
description=CLASS_SIMILARITIES_DESCRIPTION,
parameter_metadata=LOSS_METADATA["SoftmaxCrossEntropyLoss"]["class_similarities"],
)
class_similarities_temperature: int = schema_utils.NonNegativeInteger(
default=0,
description=CLASS_SIMILARITIES_TEMPERATURE_DESCRIPTION,
parameter_metadata=LOSS_METADATA["SoftmaxCrossEntropyLoss"]["class_similarities_temperature"],
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["SoftmaxCrossEntropyLoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Softmax Cross Entropy"
@DeveloperAPI
@register_loss([SEQUENCE, TEXT])
@ludwig_dataclass
class SequenceSoftmaxCrossEntropyLossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
SEQUENCE_SOFTMAX_CROSS_ENTROPY,
description="Type of loss.",
)
class_weights: list[float] | dict | None = schema_utils.OneOfOptionsField(
default=None,
description=CLASS_WEIGHTS_DESCRIPTION,
field_options=[
schema_utils.Dict(default=None, allow_none=True),
schema_utils.List(list_type=float, allow_none=False),
],
parameter_metadata=LOSS_METADATA["SequenceSoftmaxCrossEntropyLoss"]["class_weights"],
)
robust_lambda: int = schema_utils.NonNegativeInteger(
default=0,
description=ROBUST_LAMBDA_DESCRIPTION,
parameter_metadata=LOSS_METADATA["SequenceSoftmaxCrossEntropyLoss"]["robust_lambda"],
)
confidence_penalty: float = schema_utils.NonNegativeFloat(
default=0,
description=CONFIDENCE_PENALTY_DESCRIPTION,
parameter_metadata=LOSS_METADATA["SequenceSoftmaxCrossEntropyLoss"]["confidence_penalty"],
)
class_similarities: list = schema_utils.List(
list,
default=None,
description=CLASS_SIMILARITIES_DESCRIPTION,
parameter_metadata=LOSS_METADATA["SequenceSoftmaxCrossEntropyLoss"]["class_similarities"],
)
class_similarities_temperature: int = schema_utils.NonNegativeInteger(
default=0,
description=CLASS_SIMILARITIES_TEMPERATURE_DESCRIPTION,
parameter_metadata=LOSS_METADATA["SequenceSoftmaxCrossEntropyLoss"]["class_similarities_temperature"],
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["SequenceSoftmaxCrossEntropyLoss"]["weight"],
)
unique: bool = schema_utils.Boolean(
default=False,
description="If true, the loss is only computed for unique elements in the sequence.",
parameter_metadata=LOSS_METADATA["SequenceSoftmaxCrossEntropyLoss"]["unique"],
)
@classmethod
def name(self) -> str:
return "Sequence Softmax Cross Entropy"
@DeveloperAPI
@register_loss([SEQUENCE, TEXT])
@ludwig_dataclass
class NextTokenSoftmaxCrossEntropyLossConfig(SequenceSoftmaxCrossEntropyLossConfig):
type: str = schema_utils.ProtectedString(
NEXT_TOKEN_SOFTMAX_CROSS_ENTROPY,
description="Type of loss.",
)
@classmethod
def name(self) -> str:
return "Next Token Softmax Cross Entropy"
@DeveloperAPI
@register_loss([SET])
@ludwig_dataclass
class SigmoidCrossEntropyLossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
SIGMOID_CROSS_ENTROPY,
description="Type of loss.",
)
class_weights: list[float] | dict | None = schema_utils.OneOfOptionsField(
default=None,
description=CLASS_WEIGHTS_DESCRIPTION,
field_options=[
schema_utils.Dict(default=None, allow_none=True),
schema_utils.List(list_type=float, allow_none=False),
],
parameter_metadata=LOSS_METADATA["SigmoidCrossEntropyLoss"]["class_weights"],
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["SigmoidCrossEntropyLoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Sigmoid Cross Entropy"
@DeveloperAPI
@register_loss([NUMBER, TIMESERIES, VECTOR])
@ludwig_dataclass
class HuberLossConfig(BaseLossConfig):
type: str = schema_utils.ProtectedString(
HUBER,
description=(
"Loss that combines advantages of both `mean_absolute_error` (MAE) and `mean_squared_error` (MSE). The "
"delta-scaled L1 region makes the loss less sensitive to outliers than MSE, while the L2 region provides "
"smoothness over MAE near 0. See [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) for more details."
),
)
delta: float = schema_utils.FloatRange(
default=1.0,
min=0,
min_inclusive=False,
description="Threshold at which to change between delta-scaled L1 and L2 loss.",
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["MSELoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Huber Loss"
@DeveloperAPI
@register_loss([CATEGORY])
@ludwig_dataclass
class CORNLossConfig(BaseLossConfig):
"""Conditional Ordinal Regression for Neural networks, used for ordered cateogry values.
Source:
Xintong Shi, Wenzhi Cao, and Sebastian Raschka (2021).
Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities.
Arxiv preprint; https://arxiv.org/abs/2111.08851
"""
type: str = schema_utils.ProtectedString(
CORN,
description="Type of loss.",
)
weight: float = schema_utils.NonNegativeFloat(
default=1.0,
description="Weight of the loss.",
parameter_metadata=LOSS_METADATA["MSELoss"]["weight"],
)
@classmethod
def name(self) -> str:
return "Conditional Ordinal Regression (CORN)"
@property
def class_weights(self) -> int:
return 1.0
@property
def class_similarities_temperature(self) -> int:
return 0
================================================
FILE: ludwig/schema/features/loss/utils.py
================================================
from dataclasses import Field
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.loss import get_loss_classes, get_loss_cls
@DeveloperAPI
def get_loss_conds(feature_type: str):
"""Returns a JSON schema of conditionals to validate against loss types for specific feature types."""
conds = []
for loss in get_loss_classes(feature_type):
loss_cls = get_loss_cls(feature_type, loss)
other_props = schema_utils.unload_jsonschema_from_marshmallow_class(loss_cls)["properties"]
schema_utils.remove_duplicate_fields(other_props)
loss_cond = schema_utils.create_cond(
{"type": loss},
other_props,
)
conds.append(loss_cond)
return conds
@DeveloperAPI
def LossDataclassField(feature_type: str, default: str) -> Field:
loss_registry = get_loss_classes(feature_type)
class LossSelection(schema_utils.TypeSelection):
def __init__(self):
super().__init__(registry=loss_registry, default_value=default)
def get_schema_from_registry(self, key: str) -> type[schema_utils.BaseMarshmallowConfig]:
return get_loss_cls(feature_type, key)
def _jsonschema_type_mapping(self):
return {
"type": "object",
"properties": {
"type": {"type": "string", "enum": list(loss_registry.keys()), "default": default},
},
"title": "loss_options",
"allOf": get_loss_conds(feature_type),
}
return LossSelection().get_default_field()
================================================
FILE: ludwig/schema/features/number_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MEAN_SQUARED_ERROR, MODEL_ECD, NUMBER
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import DecoderDataclassField
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.loss.loss import BaseLossConfig
from ludwig.schema.features.loss.utils import LossDataclassField
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
output_mixin_registry,
)
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@input_mixin_registry.register(NUMBER)
@ludwig_dataclass
class NumberInputFeatureConfigMixin(BaseMarshmallowConfig):
"""NumberInputFeatureConfigMixin is a dataclass that configures the parameters used in both the number input
feature and the number global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=NUMBER)
@DeveloperAPI
@ludwig_dataclass
class NumberInputFeatureConfig(NumberInputFeatureConfigMixin, BaseInputFeatureConfig):
"""NumberInputFeatureConfig is a dataclass that configures the parameters used for a number input feature."""
type: str = schema_utils.ProtectedString(NUMBER)
encoder: BaseEncoderConfig = None
@DeveloperAPI
@ecd_input_config_registry.register(NUMBER)
@ludwig_dataclass
class ECDNumberInputFeatureConfig(NumberInputFeatureConfig):
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=NUMBER,
default="passthrough",
)
@DeveloperAPI
@output_mixin_registry.register(NUMBER)
@ludwig_dataclass
class NumberOutputFeatureConfigMixin(BaseMarshmallowConfig):
"""NumberOutputFeatureConfigMixin is a dataclass that configures the parameters used in both the number output
feature and the number global defaults section of the Ludwig Config."""
decoder: BaseDecoderConfig = None
loss: BaseLossConfig = LossDataclassField(
feature_type=NUMBER,
default=MEAN_SQUARED_ERROR,
)
@DeveloperAPI
@ludwig_dataclass
class NumberOutputFeatureConfig(NumberOutputFeatureConfigMixin, BaseOutputFeatureConfig):
"""NumberOutputFeatureConfig is a dataclass that configures the parameters used for a category output
feature."""
type: str = schema_utils.ProtectedString(NUMBER)
clip: list[int] | tuple[int] = schema_utils.FloatRangeTupleDataclassField(
n=2,
default=None,
allow_none=True,
min=0,
max=999999999,
description="Clip the predicted output to the specified range.",
parameter_metadata=FEATURE_METADATA[NUMBER]["clip"],
)
default_validation_metric: str = schema_utils.StringOptions(
[MEAN_SQUARED_ERROR],
default=MEAN_SQUARED_ERROR,
description="Internal only use parameter: default validation metric for number output feature.",
parameter_metadata=INTERNAL_ONLY,
)
dependencies: list = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
parameter_metadata=FEATURE_METADATA[NUMBER]["dependencies"],
)
reduce_dependencies: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce the dependencies of the output feature.",
parameter_metadata=FEATURE_METADATA[NUMBER]["reduce_dependencies"],
)
reduce_input: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=FEATURE_METADATA[NUMBER]["reduce_input"],
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="number_output")
@DeveloperAPI
@ecd_output_config_registry.register(NUMBER)
@ludwig_dataclass
class ECDNumberOutputFeatureConfig(NumberOutputFeatureConfig):
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=NUMBER,
default="regressor",
)
@DeveloperAPI
@ecd_defaults_config_registry.register(NUMBER)
@ludwig_dataclass
class NumberDefaultsConfig(NumberInputFeatureConfigMixin, NumberOutputFeatureConfigMixin):
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=NUMBER,
default="passthrough",
)
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=NUMBER,
default="regressor",
)
================================================
FILE: ludwig/schema/features/preprocessing/__init__.py
================================================
# Register all preprocessors
from ludwig.schema.features.preprocessing import ( # noqa
audio,
bag,
binary,
category,
date,
h3,
image,
number,
sequence,
set,
text,
timeseries,
vector,
)
================================================
FILE: ludwig/schema/features/preprocessing/audio.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import AUDIO, BFILL, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_preprocessor(AUDIO)
@ludwig_dataclass
class AudioPreprocessingConfig(BasePreprocessingConfig):
audio_file_length_limit_in_s: int = schema_utils.NonNegativeFloat(
default=7.5,
allow_none=False,
description="Float value that defines the maximum limit of the audio file in seconds. All files longer than "
"this limit are cut off. All files shorter than this limit are padded with padding_value",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["audio_file_length_limit_in_s"],
)
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=BFILL,
allow_none=False,
description="What strategy to follow when there's a missing value in an audio column",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["missing_value_strategy"],
)
fill_value: float = schema_utils.NonNegativeFloat(
default=None,
allow_none=True,
description="The value to replace missing values with in case the missing_value_strategy is fill_with_const",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["fill_value"],
)
computed_fill_value: float = schema_utils.NonNegativeFloat(
default=None,
allow_none=True,
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["computed_fill_value"],
)
in_memory: bool = schema_utils.Boolean(
default=True,
description="Defines whether the audio dataset will reside in memory during the training process or will be "
"dynamically fetched from disk (useful for large datasets). In the latter case a training batch "
"of input audio will be fetched from disk each training iteration.",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["in_memory"],
)
padding_value: float = schema_utils.NonNegativeFloat(
default=0.0,
allow_none=False,
description="Float value that is used for padding.",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["padding_value"],
)
norm: str = schema_utils.StringOptions(
["per_file"],
default=None,
allow_none=True,
description="Normalization strategy for the audio files. If None, no normalization is performed. If "
"per_file, z-norm is applied on a 'per file' level",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["norm"],
)
type: str = schema_utils.StringOptions(
["fbank", "group_delay", "raw", "stft", "stft_phase"],
default="fbank",
description="Defines the type of audio feature to be used.",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["type"],
)
window_length_in_s: float = schema_utils.NonNegativeFloat(
default=0.04,
description="Defines the window length used for the short time Fourier transformation. This is only needed if "
"the audio_feature_type is 'raw'.",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["window_length_in_s"],
)
window_shift_in_s: float = schema_utils.NonNegativeFloat(
default=0.02,
description="Defines the window shift used for the short time Fourier transformation (also called "
"hop_length). This is only needed if the audio_feature_type is 'raw'. ",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["window_shift_in_s"],
)
num_fft_points: float = schema_utils.NonNegativeFloat(
default=None,
allow_none=True,
description="Defines the number of fft points used for the short time Fourier transformation",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["num_fft_points"],
)
window_type: str = schema_utils.StringOptions(
["bartlett", "blackman", "hamming", "hann"],
default="hamming",
description="Defines the type window the signal is weighted before the short time Fourier transformation.",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["window_type"],
)
num_filter_bands: int = schema_utils.PositiveInteger(
default=80,
description="Defines the number of filters used in the filterbank. Only needed if audio_feature_type "
"is 'fbank'",
parameter_metadata=FEATURE_METADATA[AUDIO][PREPROCESSING]["num_filter_bands"],
)
================================================
FILE: ludwig/schema/features/preprocessing/bag.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BAG, FILL_WITH_CONST, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils import strings_utils
from ludwig.utils.tokenizers import tokenizer_registry
@DeveloperAPI
@register_preprocessor(BAG)
@ludwig_dataclass
class BagPreprocessingConfig(BasePreprocessingConfig):
tokenizer: str = schema_utils.StringOptions(
tokenizer_registry.keys(),
default="space",
allow_none=False,
description="Defines how to transform the raw text content of the dataset column to a set of elements. The "
"default value space splits the string on spaces. Common options include: underscore (splits on "
"underscore), comma (splits on comma), json (decodes the string into a set or a list through a "
"JSON parser).",
parameter_metadata=FEATURE_METADATA[BAG][PREPROCESSING]["tokenizer"],
)
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when there's a missing value in a set column",
parameter_metadata=FEATURE_METADATA[BAG][PREPROCESSING]["missing_value_strategy"],
)
fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The value to replace missing values with in case the missing_value_strategy is fill_with_const",
parameter_metadata=FEATURE_METADATA[BAG][PREPROCESSING]["fill_value"],
)
computed_fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[BAG][PREPROCESSING]["computed_fill_value"],
)
lowercase: bool = schema_utils.Boolean(
default=False,
description="If true, converts the string to lowercase before tokenizing.",
parameter_metadata=FEATURE_METADATA[BAG][PREPROCESSING]["lowercase"],
)
most_common: int = schema_utils.PositiveInteger(
default=10000,
allow_none=True,
description="The maximum number of most common tokens to be considered. If the data contains more than this "
"amount, the most infrequent tokens will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[BAG][PREPROCESSING]["most_common"],
)
================================================
FILE: ludwig/schema/features/preprocessing/base.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
@DeveloperAPI
class BasePreprocessingConfig(schema_utils.BaseMarshmallowConfig):
"""Base class for input feature preprocessing. Not meant to be used directly.
The dataclass format prevents arbitrary properties from being set. Consequently, in child classes, all properties
from the corresponding input feature class are copied over: check each class to check which attributes are different
from the preprocessing of each feature.
"""
================================================
FILE: ludwig/schema/features/preprocessing/binary.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
BFILL,
BINARY,
DROP_ROW,
FFILL,
FILL_WITH_FALSE,
FILL_WITH_MODE,
FILL_WITH_TRUE,
PREPROCESSING,
)
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils import strings_utils
@DeveloperAPI
@register_preprocessor(BINARY)
@ludwig_dataclass
class BinaryPreprocessingConfig(BasePreprocessingConfig):
"""BinaryPreprocessingConfig is a dataclass that configures the parameters used for a binary input feature."""
missing_value_strategy: str = schema_utils.StringOptions(
[FILL_WITH_MODE, BFILL, FFILL, DROP_ROW, FILL_WITH_FALSE, FILL_WITH_TRUE],
default=FILL_WITH_FALSE,
allow_none=False,
description="What strategy to follow when there's a missing value in a binary column",
parameter_metadata=FEATURE_METADATA[BINARY][PREPROCESSING]["missing_value_strategy"],
)
fallback_true_label: str = schema_utils.String(
default=None,
allow_none=True,
description="The label to interpret as 1 (True) when the binary feature doesn't have a "
"conventional boolean value",
parameter_metadata=FEATURE_METADATA[BINARY][PREPROCESSING]["fallback_true_label"],
)
fill_value: int | float | str = schema_utils.OneOfOptionsField(
default=None,
allow_none=True,
field_options=[
schema_utils.FloatRange(default=None, allow_none=True, min=0, max=1, description=""),
schema_utils.StringOptions(options=strings_utils.all_bool_strs(), default="Y", allow_none=False),
schema_utils.Boolean(default=True, description=""),
],
description="The value to replace missing values with in case the missing_value_strategy is fill_with_const",
parameter_metadata=FEATURE_METADATA[BINARY][PREPROCESSING]["fill_value"],
)
computed_fill_value: int | float | str = schema_utils.OneOfOptionsField(
default=None,
allow_none=True,
field_options=[
schema_utils.FloatRange(default=1.0, allow_none=False, min=0, max=1, description=""),
schema_utils.StringOptions(options=strings_utils.all_bool_strs(), default="Y", allow_none=False),
schema_utils.Boolean(default=True, description=""),
],
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[BINARY][PREPROCESSING]["computed_fill_value"],
)
@DeveloperAPI
@register_preprocessor("binary_output")
@ludwig_dataclass
class BinaryOutputPreprocessingConfig(BinaryPreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
[FILL_WITH_MODE, BFILL, FFILL, DROP_ROW, FILL_WITH_FALSE, FILL_WITH_TRUE],
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when there's a missing value in a binary output feature",
parameter_metadata=FEATURE_METADATA[BINARY][PREPROCESSING]["missing_value_strategy"],
)
fallback_true_label: str = schema_utils.String(
default=None,
allow_none=True,
description="The label to interpret as 1 (True) when the binary feature doesn't have a "
"conventional boolean value",
parameter_metadata=FEATURE_METADATA[BINARY][PREPROCESSING]["fallback_true_label"],
)
================================================
FILE: ludwig/schema/features/preprocessing/category.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import CATEGORY, DROP_ROW, FILL_WITH_CONST, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA, PREPROCESSING_METADATA
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils import strings_utils
@DeveloperAPI
@register_preprocessor(CATEGORY)
@ludwig_dataclass
class CategoryPreprocessingConfig(BasePreprocessingConfig):
"""CategoryPreprocessingConfig is a dataclass that configures the parameters used for a category input
feature."""
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when there's a missing value in a category column",
parameter_metadata=FEATURE_METADATA[CATEGORY][PREPROCESSING]["missing_value_strategy"],
)
fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description=(
"The value to replace missing values with in case the `missing_value_strategy` is `fill_with_const`"
),
parameter_metadata=FEATURE_METADATA[CATEGORY][PREPROCESSING]["fill_value"],
)
computed_fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[CATEGORY][PREPROCESSING]["computed_fill_value"],
)
lowercase: bool = schema_utils.Boolean(
default=False,
description="Whether the string has to be lowercased before being handled by the tokenizer.",
parameter_metadata=FEATURE_METADATA[CATEGORY][PREPROCESSING]["lowercase"],
)
most_common: int = schema_utils.PositiveInteger(
default=10000,
allow_none=True,
description="The maximum number of most common tokens to be considered. if the data contains more than this "
"amount, the most infrequent tokens will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[CATEGORY][PREPROCESSING]["most_common"],
)
cache_encoder_embeddings: bool = schema_utils.Boolean(
default=False,
description=(
"For fixed encoders, compute encoder embeddings in preprocessing to avoid this step at train time. "
"Can speed up the time taken per step during training, but will invalidate the preprocessed data "
"if the encoder type is changed."
),
parameter_metadata=PREPROCESSING_METADATA["cache_encoder_embeddings"],
)
@DeveloperAPI
@register_preprocessor("category_output")
@ludwig_dataclass
class CategoryOutputPreprocessingConfig(CategoryPreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when there's a missing value in a category output feature",
parameter_metadata=FEATURE_METADATA[CATEGORY][PREPROCESSING]["missing_value_strategy"],
)
lowercase: bool = schema_utils.Boolean(
default=False,
description="Whether the string has to be lowercased before being handled by the tokenizer.",
parameter_metadata=FEATURE_METADATA[CATEGORY][PREPROCESSING]["lowercase"],
)
most_common: int = schema_utils.PositiveInteger(
default=10000,
allow_none=True,
description="The maximum number of most common tokens to be considered. if the data contains more than this "
"amount, the most infrequent tokens will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[CATEGORY][PREPROCESSING]["most_common"],
)
@DeveloperAPI
@register_preprocessor("category_distribution_output")
@ludwig_dataclass
class CategoryDistributionOutputPreprocessingConfig(BasePreprocessingConfig):
def __post_init__(self):
if self.vocab is None:
raise ConfigValidationError("`vocab` must be specified for `category_distribution` output feature.")
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when there's a missing value in a category output feature",
parameter_metadata=FEATURE_METADATA[CATEGORY][PREPROCESSING]["missing_value_strategy"],
)
vocab: list[str] = schema_utils.List(default=None)
@DeveloperAPI
@register_preprocessor("category_llm")
@ludwig_dataclass
class LLMCategoryOutputPreprocessingConfig(CategoryOutputPreprocessingConfig):
def __post_init__(self):
if self.vocab is None:
raise ConfigValidationError("`vocab` must be specified for `category_llm` output feature.")
if self.fallback_label is None:
raise ConfigValidationError("`fallback_label` must be specified for `category_llm` output feature.")
vocab: list[str] = schema_utils.List(
default=None,
allow_none=False,
description="The list of labels that the model can predict.",
)
fallback_label: str = schema_utils.String(
default="",
allow_none=False,
description="The label to use when the model doesn't match any of the labels in the `labels` list.",
)
================================================
FILE: ludwig/schema/features/preprocessing/date.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BFILL, DATE, DROP_ROW, FFILL, FILL_WITH_CONST, PREPROCESSING
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_preprocessor(DATE)
@ludwig_dataclass
class DatePreprocessingConfig(BasePreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
[FILL_WITH_CONST, BFILL, FFILL, DROP_ROW],
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when there's a missing value in a date column",
parameter_metadata=FEATURE_METADATA[DATE][PREPROCESSING]["missing_value_strategy"],
)
fill_value: str = schema_utils.String(
default="",
allow_none=False,
description="The value to replace missing values with in case the missing_value_strategy is fill_with_const",
parameter_metadata=FEATURE_METADATA[DATE][PREPROCESSING]["fill_value"],
)
computed_fill_value: str = schema_utils.String(
default="",
allow_none=False,
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[DATE][PREPROCESSING]["computed_fill_value"],
)
datetime_format: str = schema_utils.String(
default=None,
allow_none=True,
description="This parameter can either be a datetime format string, or null, in which case the datetime "
"format will be inferred automatically.",
parameter_metadata=FEATURE_METADATA[DATE][PREPROCESSING]["datetime_format"],
)
================================================
FILE: ludwig/schema/features/preprocessing/h3.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import FILL_WITH_CONST, H3, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_preprocessor(H3)
@ludwig_dataclass
class H3PreprocessingConfig(BasePreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when there's a missing value in an h3 column",
parameter_metadata=FEATURE_METADATA[H3][PREPROCESSING]["missing_value_strategy"],
)
fill_value: int = schema_utils.PositiveInteger(
default=576495936675512319,
allow_none=False,
description="The value to replace missing values with in case the missing_value_strategy is fill_with_const",
parameter_metadata=FEATURE_METADATA[H3][PREPROCESSING]["fill_value"],
)
computed_fill_value: int = schema_utils.PositiveInteger(
default=576495936675512319,
allow_none=False,
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[H3][PREPROCESSING]["computed_fill_value"],
)
================================================
FILE: ludwig/schema/features/preprocessing/image.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BFILL, DROP_ROW, IMAGE, IMAGENET1K, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_preprocessor(IMAGE)
@ludwig_dataclass
class ImagePreprocessingConfig(BasePreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=BFILL,
allow_none=False,
description="What strategy to follow when there's a missing value in an image column",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["missing_value_strategy"],
)
fill_value: float = schema_utils.NonNegativeFloat(
default=None,
allow_none=True,
description="The maximum number of most common tokens to be considered. If the data contains more than this "
"amount, the most infrequent tokens will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["fill_value"],
)
computed_fill_value: float = schema_utils.NonNegativeFloat(
default=None,
allow_none=True,
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["computed_fill_value"],
)
height: int | None = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The image height in pixels. If this parameter is set, images will be resized to the specified "
"height using the resize_method parameter. If None, images will be resized to the size of the "
"first image in the dataset.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["height"],
)
width: int | None = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The image width in pixels. If this parameter is set, images will be resized to the specified "
"width using the resize_method parameter. If None, images will be resized to the size of the "
"first image in the dataset.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["width"],
)
num_channels: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of channels in the images. If specified, images will be read in the mode specified by the "
"number of channels. If not specified, the number of channels will be inferred from the image "
"format of the first valid image in the dataset.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["num_channels"],
)
resize_method: str = schema_utils.StringOptions(
["crop_or_pad", "interpolate"],
default="interpolate",
allow_none=False,
description="The method to use for resizing images.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["resize_method"],
)
infer_image_num_channels: bool = schema_utils.Boolean(
default=True,
description="If true, then the number of channels in the dataset is inferred from a sample of the first image "
"in the dataset.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["infer_image_num_channels"],
)
infer_image_dimensions: bool = schema_utils.Boolean(
default=True,
description="If true, then the height and width of images in the dataset will be inferred from a sample of "
"the first image in the dataset. Each image that doesn't conform to these dimensions will be "
"resized according to resize_method. If set to false, then the height and width of images in the "
"dataset will be specified by the user.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["infer_image_dimensions"],
)
infer_image_max_height: int = schema_utils.PositiveInteger(
default=256,
allow_none=False,
description="If infer_image_dimensions is set, this is used as the maximum height of the images in "
"the dataset.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["infer_image_max_height"],
)
infer_image_max_width: int = schema_utils.PositiveInteger(
default=256,
allow_none=False,
description="If infer_image_dimensions is set, this is used as the maximum width of the images in "
"the dataset.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["infer_image_max_width"],
)
infer_image_sample_size: int = schema_utils.PositiveInteger(
default=100,
allow_none=False,
description="The sample size used for inferring dimensions of images in infer_image_dimensions.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["infer_image_sample_size"],
)
standardize_image: str | None = schema_utils.StringOptions(
[IMAGENET1K],
default=None,
allow_none=True,
description="Standardize image by per channel mean centering and standard deviation scaling .",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["standardize_image"],
)
in_memory: bool = schema_utils.Boolean(
default=True,
description="Defines whether image dataset will reside in memory during the training process or will be "
"dynamically fetched from disk (useful for large datasets). In the latter case a training batch "
"of input images will be fetched from disk each training iteration.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["in_memory"],
)
num_processes: int = schema_utils.PositiveInteger(
default=1,
allow_none=False,
description="Specifies the number of processes to run for preprocessing images.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["num_processes"],
)
requires_equal_dimensions: bool = schema_utils.Boolean(
default=False,
description="If true, then width and height must be equal.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["requires_equal_dimensions"],
)
num_classes: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of channel classes in the images. If specified, this value will be validated "
"against the inferred number of classes. Use 2 to convert grayscale images to binary images.",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["num_classes"],
)
infer_image_num_classes: bool = schema_utils.Boolean(
default=False,
description="If true, then the number of channel classes in the dataset will be inferred from a sample of "
"the first image in the dataset. Each unique channel value will be mapped to a class and preprocessing will "
"create a masked image based on the channel classes. ",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["infer_image_num_classes"],
)
@DeveloperAPI
@register_preprocessor("image_output")
@ludwig_dataclass
class ImageOutputPreprocessingConfig(ImagePreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when there's a missing value in an image column",
parameter_metadata=FEATURE_METADATA[IMAGE][PREPROCESSING]["missing_value_strategy"],
)
================================================
FILE: ludwig/schema/features/preprocessing/number.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
DROP_ROW,
FILL_WITH_CONST,
FILL_WITH_MEAN,
MISSING_VALUE_STRATEGY_OPTIONS,
NUMBER,
PREPROCESSING,
)
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_preprocessor(NUMBER)
@ludwig_dataclass
class NumberPreprocessingConfig(BasePreprocessingConfig):
"""NumberPreprocessingConfig is a dataclass that configures the parameters used for a number input feature."""
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS + [FILL_WITH_MEAN],
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when there's a missing value in a number column",
parameter_metadata=FEATURE_METADATA[NUMBER][PREPROCESSING]["missing_value_strategy"],
)
fill_value: float = schema_utils.FloatRange(
default=0.0,
allow_none=False,
description="The value to replace missing values with in case the missing_value_strategy is fill_with_const",
parameter_metadata=FEATURE_METADATA[NUMBER][PREPROCESSING]["fill_value"],
)
computed_fill_value: float = schema_utils.FloatRange(
default=0.0,
allow_none=False,
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[NUMBER][PREPROCESSING]["computed_fill_value"],
)
normalization: str = schema_utils.StringOptions(
["zscore", "minmax", "log1p", "iq"],
default="zscore",
allow_none=True,
description=(
"Normalization strategy to use for this number feature. If the value is `null` no normalization is "
"performed."
),
parameter_metadata=FEATURE_METADATA[NUMBER][PREPROCESSING]["normalization"],
)
outlier_strategy: str | None = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS + [FILL_WITH_MEAN, None],
default=None,
allow_none=True,
description=(
"Determines how outliers will be handled in the dataset. In most cases, replacing outliers with the "
"column mean (`fill_with_mean`) will be sufficient, but in others the outliers may be damaging enough "
"to merit dropping the entire row of data (`drop_row`). In some cases, the best way to handle outliers "
"is to leave them in the data, which is the behavior when this parameter is left as `null`."
),
parameter_metadata=FEATURE_METADATA[NUMBER][PREPROCESSING]["outlier_strategy"],
)
outlier_threshold: float | None = schema_utils.FloatRange(
default=3.0,
allow_none=False,
min=0.0,
description=(
"Standard deviations from the mean past which a value is considered an outlier. The 3-sigma "
"rule in statistics tells us that when data is normally distributed, 95% of the data will lie within 2 "
"standard deviations of the mean, and greater than 99% of the data will lie within 3 standard deviations "
"of the mean (see: [68–95–99.7 rule](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)). "
"As such anything farther away than that is highly likely to be an outlier, and may distort the learning "
"process by disproportionately affecting the model."
),
parameter_metadata=FEATURE_METADATA[NUMBER][PREPROCESSING]["outlier_threshold"],
)
computed_outlier_fill_value: float = schema_utils.FloatRange(
default=0.0,
allow_none=False,
description="The internally computed fill value to replace outliers with in case the "
"outlier_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[NUMBER][PREPROCESSING]["computed_outlier_fill_value"],
)
@DeveloperAPI
@register_preprocessor("number_output")
@ludwig_dataclass
class NumberOutputPreprocessingConfig(NumberPreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS + [FILL_WITH_MEAN],
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when there's a missing value in a number output feature",
parameter_metadata=FEATURE_METADATA[NUMBER][PREPROCESSING]["missing_value_strategy"],
)
normalization: str = schema_utils.StringOptions(
["zscore", "minmax", "log1p", "iq"],
default=None,
allow_none=True,
description="Normalization strategy to use for this number feature.",
parameter_metadata=FEATURE_METADATA[NUMBER][PREPROCESSING]["normalization"],
)
================================================
FILE: ludwig/schema/features/preprocessing/sequence.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DROP_ROW, FILL_WITH_CONST, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING, SEQUENCE
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA, PREPROCESSING_METADATA
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils import strings_utils
@DeveloperAPI
@register_preprocessor(SEQUENCE)
@ludwig_dataclass
class SequencePreprocessingConfig(BasePreprocessingConfig):
tokenizer: str = schema_utils.String(
default="space",
allow_none=False,
description="Defines how to map from the raw string content of the dataset column to a sequence of elements.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["tokenizer"],
)
vocab_file: str = schema_utils.String(
default=None,
allow_none=True,
description="Filepath string to a UTF-8 encoded file containing the sequence's vocabulary. On each line the "
"first string until \t or \n is considered a word.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["vocab_file"],
)
sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The desired length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated and sequences shorter than this value will be padded. If None, sequence length will be "
"inferred from the training dataset.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["sequence_length"],
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=256,
allow_none=True,
description="The maximum length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated. Useful as a stopgap measure if `sequence_length` is set to `None`. If `None`, max sequence "
"length will be inferred from the training dataset.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["max_sequence_length"],
)
most_common: int = schema_utils.PositiveInteger(
default=20000,
allow_none=False,
description="The maximum number of most common tokens in the vocabulary. If the data contains more than this "
"amount, the most infrequent symbols will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["most_common"],
)
padding_symbol: str = schema_utils.String(
default=strings_utils.PADDING_SYMBOL,
allow_none=False,
description="The string used as a padding symbol. This special token is mapped to the integer ID 0 in the "
"vocabulary.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["padding_symbol"],
)
unknown_symbol: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The string used as an unknown placeholder. This special token is mapped to the integer ID 1 in "
"the vocabulary.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["unknown_symbol"],
)
padding: str = schema_utils.StringOptions(
["left", "right"],
default="right",
allow_none=False,
description="The direction of the padding.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["padding"],
)
lowercase: bool = schema_utils.Boolean(
default=False,
description="If true, converts the string to lowercase before tokenizing.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["lowercase"],
)
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when there's a missing value in a text column",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["missing_value_strategy"],
)
fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The value to replace missing values with in case the missing_value_strategy is fill_with_const",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["fill_value"],
)
computed_fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["computed_fill_value"],
)
ngram_size: int = schema_utils.PositiveInteger(
default=2,
allow_none=False,
description="The size of the ngram when using the `ngram` tokenizer (e.g, 2 = bigram, 3 = trigram, etc.).",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["ngram_size"],
)
cache_encoder_embeddings: bool = schema_utils.Boolean(
default=False,
description="Compute encoder embeddings in preprocessing, speeding up training time considerably.",
parameter_metadata=PREPROCESSING_METADATA["cache_encoder_embeddings"],
)
@DeveloperAPI
@register_preprocessor("sequence_output")
@ludwig_dataclass
class SequenceOutputPreprocessingConfig(SequencePreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when there's a missing value in a sequence output feature",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["missing_value_strategy"],
)
sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The desired length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated and sequences shorter than this value will be padded. If None, sequence length will be "
"inferred from the training dataset.",
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=256,
allow_none=True,
description="The maximum length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated. Useful as a stopgap measure if `sequence_length` is set to `None`. If `None`, max sequence "
"length will be inferred from the training dataset.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["max_sequence_length"],
)
tokenizer: str = schema_utils.String(
default="space",
allow_none=False,
description="Defines how to map from the raw string content of the dataset column to a sequence of elements.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["tokenizer"],
)
lowercase: bool = schema_utils.Boolean(
default=False,
description="If true, converts the string to lowercase before tokenizing.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["lowercase"],
)
most_common: int = schema_utils.PositiveInteger(
default=20000,
allow_none=False,
description="The maximum number of most common tokens in the vocabulary. If the data contains more than this "
"amount, the most infrequent symbols will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["most_common"],
)
ngram_size: int = schema_utils.PositiveInteger(
default=2,
allow_none=False,
description="The size of the ngram when using the `ngram` tokenizer (e.g, 2 = bigram, 3 = trigram, etc.).",
parameter_metadata=FEATURE_METADATA[SEQUENCE][PREPROCESSING]["ngram_size"],
)
================================================
FILE: ludwig/schema/features/preprocessing/set.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DROP_ROW, FILL_WITH_CONST, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING, SET
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils import strings_utils
@DeveloperAPI
@register_preprocessor(SET)
@ludwig_dataclass
class SetPreprocessingConfig(BasePreprocessingConfig):
tokenizer: str = schema_utils.String(
default="space",
allow_none=False,
description="Defines how to transform the raw text content of the dataset column to a set of elements. The "
"default value space splits the string on spaces. Common options include: underscore (splits on "
"underscore), comma (splits on comma), json (decodes the string into a set or a list through a "
"JSON parser).",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["tokenizer"],
)
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when there's a missing value in a set column",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["missing_value_strategy"],
)
fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The value to replace missing values with in case the missing_value_strategy is fill_with_const",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["fill_value"],
)
computed_fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["computed_fill_value"],
)
lowercase: bool = schema_utils.Boolean(
default=False,
description="If true, converts the string to lowercase before tokenizing.",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["lowercase"],
)
most_common: int = schema_utils.PositiveInteger(
default=10000,
allow_none=True,
description="The maximum number of most common tokens to be considered. If the data contains more than this "
"amount, the most infrequent tokens will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["most_common"],
)
@DeveloperAPI
@register_preprocessor("set_output")
@ludwig_dataclass
class SetOutputPreprocessingConfig(SetPreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when there's a missing value in a set output feature",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["missing_value_strategy"],
)
tokenizer: str = schema_utils.String(
default="space",
allow_none=False,
description="Defines how to transform the raw text content of the dataset column to a set of elements. The "
"default value space splits the string on spaces. Common options include: underscore (splits on "
"underscore), comma (splits on comma), json (decodes the string into a set or a list through a "
"JSON parser).",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["tokenizer"],
)
lowercase: bool = schema_utils.Boolean(
default=False,
description="If true, converts the string to lowercase before tokenizing.",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["lowercase"],
)
most_common: int = schema_utils.PositiveInteger(
default=10000,
allow_none=True,
description="The maximum number of most common tokens to be considered. If the data contains more than this "
"amount, the most infrequent tokens will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[SET][PREPROCESSING]["most_common"],
)
================================================
FILE: ludwig/schema/features/preprocessing/text.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DROP_ROW, FILL_WITH_CONST, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING, TEXT
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.llms.prompt import PromptConfig, PromptConfigField
from ludwig.schema.metadata import FEATURE_METADATA, PREPROCESSING_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils import strings_utils
from ludwig.utils.tokenizers import tokenizer_registry
@DeveloperAPI
@ludwig_dataclass
class BaseTextPreprocessingConfig(BasePreprocessingConfig):
"""TextPreprocessingConfig is a dataclass that configures the parameters used for a text input feature."""
pretrained_model_name_or_path: str = schema_utils.String(
default=None,
allow_none=True,
description="This can be either the name of a pretrained HuggingFace model or a path where it was downloaded.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["pretrained_model_name_or_path"],
)
tokenizer: str = schema_utils.StringOptions(
tokenizer_registry.keys(),
default="space_punct",
allow_none=False,
description="Defines how to map from the raw string content of the dataset column to a sequence of elements.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["tokenizer"],
)
vocab_file: str = schema_utils.String(
default=None,
allow_none=True,
description="Filepath string to a UTF-8 encoded file containing the sequence's vocabulary. On each line the "
"first string until `\\t` or `\\n` is considered a word.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["vocab_file"],
)
sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The desired length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated and sequences shorter than this value will be padded. If None, sequence length will be "
"inferred from the training dataset.",
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=256,
allow_none=True,
description="The maximum length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated. Useful as a stopgap measure if `sequence_length` is set to `None`. If `None`, max sequence "
"length will be inferred from the training dataset.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["max_sequence_length"],
)
most_common: int = schema_utils.PositiveInteger(
default=20000,
allow_none=False,
description="The maximum number of most common tokens in the vocabulary. If the data contains more than this "
"amount, the most infrequent symbols will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["most_common"],
)
padding_symbol: str = schema_utils.String(
default=strings_utils.PADDING_SYMBOL,
allow_none=False,
description="The string used as the padding symbol for sequence features. Ignored for features using "
"huggingface encoders, which have their own vocabulary.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["padding_symbol"],
)
unknown_symbol: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The string used as the unknown symbol for sequence features. Ignored for features using "
"huggingface encoders, which have their own vocabulary.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["unknown_symbol"],
)
padding: str = schema_utils.StringOptions(
["left", "right"],
default="right",
allow_none=False,
description="The direction of the padding.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["padding"],
)
lowercase: bool = schema_utils.Boolean(
default=False,
description="If true, converts the string to lowercase before tokenizing.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["lowercase"],
)
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when there's a missing value in a text column.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["missing_value_strategy"],
)
fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description=(
"The value to replace missing values with in case the `missing_value_strategy` is `fill_with_const`."
),
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["fill_value"],
)
computed_fill_value: str = schema_utils.String(
default=strings_utils.UNKNOWN_SYMBOL,
allow_none=False,
description="The internally computed fill value to replace missing values with in case the "
"`missing_value_strategy` is `fill_with_mode` or `fill_with_mean`.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["computed_fill_value"],
)
ngram_size: int = schema_utils.PositiveInteger(
default=2,
allow_none=False,
description="The size of the ngram when using the `ngram` tokenizer (e.g, 2 = bigram, 3 = trigram, etc.).",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["ngram_size"],
)
cache_encoder_embeddings: bool = schema_utils.Boolean(
default=False,
description=(
"For pretrained encoders, compute encoder embeddings in preprocessing, "
"speeding up training time considerably. Only supported when `encoder.trainable=false`."
),
parameter_metadata=PREPROCESSING_METADATA["cache_encoder_embeddings"],
)
compute_idf: bool = schema_utils.Boolean(
default=False,
parameter_metadata=INTERNAL_ONLY,
)
@DeveloperAPI
@register_preprocessor(TEXT)
@ludwig_dataclass
class TextPreprocessingConfig(BaseTextPreprocessingConfig):
"""TextPreprocessingConfig is a dataclass that configures the parameters used for a text input feature."""
prompt: PromptConfig = PromptConfigField().get_default_field()
@DeveloperAPI
@register_preprocessor("text_llm_input")
@ludwig_dataclass
class LLMTextInputPreprocessingConfig(BaseTextPreprocessingConfig):
"""LLMs require the prompt to be provided at the top-level, not preprocessing."""
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The maximum length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated. Useful as a stopgap measure if `sequence_length` is set to `None`. If `None`, max sequence "
"length will be inferred from the training dataset.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["max_sequence_length"],
)
@DeveloperAPI
@register_preprocessor("text_output")
@ludwig_dataclass
class TextOutputPreprocessingConfig(BaseTextPreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when there's a missing value in a text output feature.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["missing_value_strategy"],
)
sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The desired length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated and sequences shorter than this value will be padded. If None, sequence length will be "
"inferred from the training dataset.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["sequence_length"],
)
max_sequence_length: int = schema_utils.PositiveInteger(
default=256,
allow_none=True,
description="The maximum length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated. Useful as a stopgap measure if `sequence_length` is set to `None`. If `None`, max sequence "
"length will be inferred from the training dataset.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["max_sequence_length"],
)
tokenizer: str = schema_utils.StringOptions(
tokenizer_registry.keys(),
default="space_punct",
allow_none=False,
description="Defines how to map from the raw string content of the dataset column to a sequence of elements.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["tokenizer"],
)
lowercase: bool = schema_utils.Boolean(
default=False,
description="If true, converts the string to lowercase before tokenizing.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["lowercase"],
)
most_common: int = schema_utils.PositiveInteger(
default=20000,
allow_none=False,
description="The maximum number of most common tokens in the vocabulary. If the data contains more than this "
"amount, the most infrequent symbols will be treated as unknown.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["most_common"],
)
ngram_size: int = schema_utils.PositiveInteger(
default=2,
allow_none=False,
description="The size of the ngram when using the `ngram` tokenizer (e.g, 2 = bigram, 3 = trigram, etc.).",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["ngram_size"],
)
@DeveloperAPI
@register_preprocessor("text_llm_output")
@ludwig_dataclass
class LLMTextOutputPreprocessingConfig(TextOutputPreprocessingConfig):
max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The maximum length (number of tokens) of the sequence. Sequences that are longer than this value "
"will be truncated. Useful as a stopgap measure if `sequence_length` is set to `None`. If `None`, max sequence "
"length will be inferred from the training dataset.",
parameter_metadata=FEATURE_METADATA[TEXT][PREPROCESSING]["max_sequence_length"],
)
================================================
FILE: ludwig/schema/features/preprocessing/timeseries.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DROP_ROW, FILL_WITH_CONST, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING, TIMESERIES
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils.tokenizers import tokenizer_registry
@ludwig_dataclass
class BaseTimeseriesPreprocessingConfig(BasePreprocessingConfig):
tokenizer: str = schema_utils.StringOptions(
tokenizer_registry.keys(),
default="space",
allow_none=False,
description="Defines how to map from the raw string content of the dataset column to a sequence of elements.",
parameter_metadata=FEATURE_METADATA[TIMESERIES][PREPROCESSING]["tokenizer"],
)
timeseries_length_limit: int = schema_utils.PositiveInteger(
default=256,
allow_none=False,
description="Defines the maximum length of the timeseries. All timeseries longer than this limit are cut off.",
parameter_metadata=FEATURE_METADATA[TIMESERIES][PREPROCESSING]["timeseries_length_limit"],
)
padding_value: float = schema_utils.NonNegativeFloat(
default=0.0,
allow_none=False,
description="Float value that is used for padding and replacing missing values within a row.",
parameter_metadata=FEATURE_METADATA[TIMESERIES][PREPROCESSING]["padding_value"],
)
padding: str = schema_utils.StringOptions(
["left", "right"],
default="right",
allow_none=False,
description="The direction of the padding.",
parameter_metadata=FEATURE_METADATA[TIMESERIES][PREPROCESSING]["padding"],
)
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=FILL_WITH_CONST,
allow_none=False,
description=(
"What strategy to follow when there's a missing value in a column. Currently applies only to a row missing "
"in its entirety, not invididual elements within the row. For now, `NaN` values within a row are filled "
"using the `padding_value`."
),
parameter_metadata=FEATURE_METADATA[TIMESERIES][PREPROCESSING]["missing_value_strategy"],
)
fill_value: str = schema_utils.String(
default="",
allow_none=False,
description=(
"The value to replace missing values with in case the `missing_value_strategy` is `fill_with_const`."
),
parameter_metadata=FEATURE_METADATA[TIMESERIES][PREPROCESSING]["fill_value"],
)
computed_fill_value: str = schema_utils.String(
default="",
allow_none=False,
description=(
"The internally computed fill value to replace missing values with in case the "
"`missing_value_strategy` is `fill_with_mode` or `fill_with_mean`."
),
parameter_metadata=FEATURE_METADATA[TIMESERIES][PREPROCESSING]["computed_fill_value"],
)
@DeveloperAPI
@register_preprocessor(TIMESERIES)
@ludwig_dataclass
class TimeseriesPreprocessingConfig(BaseTimeseriesPreprocessingConfig):
window_size: int = schema_utils.NonNegativeInteger(
default=0,
allow_none=False,
description=(
"Optional lookback window size used to convert a column-major dataset (one observation per row) "
"into a row-major dataset (each row has a timeseries window of observations). Starting from a given "
"observation, a sliding window is taken going `window_size - 1` rows back to form the timeseries input "
"feature. If this value is left as 0, then it is assumed that the dataset has been provided in row-major "
"format (i.e., it has already been preprocessed such that each row is a timeseries window)."
),
)
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when a row of data is missing.",
parameter_metadata=FEATURE_METADATA[TIMESERIES][PREPROCESSING]["missing_value_strategy"],
)
@DeveloperAPI
@register_preprocessor("timeseries_output")
@ludwig_dataclass
class TimeseriesOutputPreprocessingConfig(BaseTimeseriesPreprocessingConfig):
horizon: int = schema_utils.NonNegativeInteger(
default=0,
allow_none=False,
description=(
"Optional forecasting horizon used to convert a column-major dataset (one observation per row) "
"into a row-major dataset (each row has a timeseries window of observations). Starting from a given "
"observation, a sliding window is token going `horizon` rows forward in time, excluding the observation "
"in the current row. If this value is left as 0, then it is assumed that the dataset has been provided in "
"row-major format (i.e., it has already been preprocessed such that each row is a timeseries window)."
),
)
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when a row of data is missing.",
parameter_metadata=FEATURE_METADATA[TIMESERIES][PREPROCESSING]["missing_value_strategy"],
)
================================================
FILE: ludwig/schema/features/preprocessing/utils.py
================================================
from dataclasses import field
from ludwig.api_annotations import DeveloperAPI
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.utils.registry import Registry
preprocessing_registry = Registry()
@DeveloperAPI
def register_preprocessor(name: str):
def wrap(preprocessing_config: BasePreprocessingConfig):
preprocessing_registry[name] = preprocessing_config
return preprocessing_config
return wrap
@DeveloperAPI
def PreprocessingDataclassField(feature_type: str):
"""Custom dataclass field that when used inside a dataclass will allow the user to specify a preprocessing
config.
Returns: Initialized dataclass field that converts an untyped dict with params to a preprocessing config.
"""
class PreprocessingMarshmallowField(schema_utils.LudwigSchemaField):
"""Custom field that deserializes a dict for a valid preprocessing config from the preprocessing_registry
and creates a corresponding JSON schema for external usage."""
def _deserialize(self, value, attr, data, **kwargs):
if value is None:
return None
if isinstance(value, dict):
if feature_type in preprocessing_registry:
pre = preprocessing_registry[feature_type]
try:
return pre.Schema().load(value)
except (TypeError, ConfigValidationError) as error:
raise ConfigValidationError(
f"Invalid preprocessing params: {value}, see `{pre}` definition. Error: {error}"
)
raise ConfigValidationError(
f"Invalid params for preprocessor: {value}, expect dict with at least a valid `type` attribute."
)
raise ConfigValidationError("Field should be None or dict")
def _jsonschema_type_mapping(self):
preprocessor_cls = preprocessing_registry[feature_type]
props = schema_utils.unload_jsonschema_from_marshmallow_class(preprocessor_cls)["properties"]
return {
"type": "object",
"properties": props,
"title": "preprocessing_options",
"additionalProperties": True,
}
try:
preprocessor = preprocessing_registry[feature_type]
load_default = lambda: preprocessor.Schema().load({})
dump_default = preprocessor.Schema().dump({})
return field(
metadata={
"marshmallow_field": PreprocessingMarshmallowField(
allow_none=False,
dump_default=dump_default,
load_default=load_default,
)
},
default_factory=load_default,
)
except Exception as e:
raise ConfigValidationError(
f"Unsupported preprocessing type: {feature_type}. See preprocessing_registry. " f"Details: {e}"
)
================================================
FILE: ludwig/schema/features/preprocessing/vector.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DROP_ROW, FILL_WITH_CONST, MISSING_VALUE_STRATEGY_OPTIONS, PREPROCESSING, VECTOR
from ludwig.schema import utils as schema_utils
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import register_preprocessor
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_preprocessor(VECTOR)
@ludwig_dataclass
class VectorPreprocessingConfig(BasePreprocessingConfig):
vector_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The size of the vector. If None, the vector size will be inferred from the data.",
parameter_metadata=FEATURE_METADATA[VECTOR][PREPROCESSING]["vector_size"],
)
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=FILL_WITH_CONST,
allow_none=False,
description="What strategy to follow when there's a missing value in a vector column",
parameter_metadata=FEATURE_METADATA[VECTOR][PREPROCESSING]["missing_value_strategy"],
)
fill_value: str = schema_utils.String(
default="",
allow_none=False,
pattern=r"^([0-9]+(\.[0-9]*)?\s*)*$",
description="The value to replace missing values with in case the missing_value_strategy is fill_with_const",
parameter_metadata=FEATURE_METADATA[VECTOR][PREPROCESSING]["fill_value"],
)
computed_fill_value: str = schema_utils.String(
default="",
allow_none=False,
pattern=r"^([0-9]+(\.[0-9]*)?\s*)*$",
description="The internally computed fill value to replace missing values with in case the "
"missing_value_strategy is fill_with_mode or fill_with_mean",
parameter_metadata=FEATURE_METADATA[VECTOR][PREPROCESSING]["computed_fill_value"],
)
@DeveloperAPI
@register_preprocessor("vector_output")
@ludwig_dataclass
class VectorOutputPreprocessingConfig(VectorPreprocessingConfig):
missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS,
default=DROP_ROW,
allow_none=False,
description="What strategy to follow when there's a missing value in a vector output feature",
parameter_metadata=FEATURE_METADATA[VECTOR][PREPROCESSING]["missing_value_strategy"],
)
vector_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The size of the vector. If None, the vector size will be inferred from the data.",
parameter_metadata=FEATURE_METADATA[VECTOR][PREPROCESSING]["vector_size"],
)
================================================
FILE: ludwig/schema/features/sequence_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import LOSS, MODEL_ECD, SEQUENCE, SEQUENCE_SOFTMAX_CROSS_ENTROPY
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import DecoderDataclassField
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.loss.loss import BaseLossConfig
from ludwig.schema.features.loss.utils import LossDataclassField
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
output_mixin_registry,
)
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@input_mixin_registry.register(SEQUENCE)
@ludwig_dataclass
class SequenceInputFeatureConfigMixin(BaseMarshmallowConfig):
"""SequenceInputFeatureConfigMixin is a dataclass that configures the parameters used in both the sequence
input feature and the sequence global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=SEQUENCE)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=SEQUENCE,
default="embed",
)
@DeveloperAPI
@ecd_input_config_registry.register(SEQUENCE)
@ludwig_dataclass
class SequenceInputFeatureConfig(SequenceInputFeatureConfigMixin, BaseInputFeatureConfig):
"""SequenceInputFeatureConfig is a dataclass that configures the parameters used for a sequence input
feature."""
type: str = schema_utils.ProtectedString(SEQUENCE)
@DeveloperAPI
@output_mixin_registry.register(SEQUENCE)
@ludwig_dataclass
class SequenceOutputFeatureConfigMixin(BaseMarshmallowConfig):
"""SequenceOutputFeatureConfigMixin is a dataclass that configures the parameters used in both the sequence
output feature and the sequence global defaults section of the Ludwig Config."""
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=SEQUENCE,
default="generator",
)
loss: BaseLossConfig = LossDataclassField(
feature_type=SEQUENCE,
default=SEQUENCE_SOFTMAX_CROSS_ENTROPY,
)
@DeveloperAPI
@ecd_output_config_registry.register(SEQUENCE)
@ludwig_dataclass
class SequenceOutputFeatureConfig(SequenceOutputFeatureConfigMixin, BaseOutputFeatureConfig):
"""SequenceOutputFeatureConfig is a dataclass that configures the parameters used for a sequence output
feature."""
type: str = schema_utils.ProtectedString(SEQUENCE)
default_validation_metric: str = schema_utils.StringOptions(
[LOSS],
default=LOSS,
description="Internal only use parameter: default validation metric for sequence output feature.",
parameter_metadata=INTERNAL_ONLY,
)
dependencies: list = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
parameter_metadata=FEATURE_METADATA[SEQUENCE]["dependencies"],
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="sequence_output")
reduce_dependencies: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce the dependencies of the output feature.",
parameter_metadata=FEATURE_METADATA[SEQUENCE]["reduce_dependencies"],
)
reduce_input: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=FEATURE_METADATA[SEQUENCE]["reduce_input"],
)
@DeveloperAPI
@ecd_defaults_config_registry.register(SEQUENCE)
@ludwig_dataclass
class SequenceDefaultsConfig(SequenceInputFeatureConfigMixin, SequenceOutputFeatureConfigMixin):
pass
================================================
FILE: ludwig/schema/features/set_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import JACCARD, MODEL_ECD, SET, SIGMOID_CROSS_ENTROPY
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import DecoderDataclassField
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.loss.loss import BaseLossConfig
from ludwig.schema.features.loss.utils import LossDataclassField
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
output_mixin_registry,
)
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@input_mixin_registry.register(SET)
@ludwig_dataclass
class SetInputFeatureConfigMixin(BaseMarshmallowConfig):
"""SetInputFeatureConfigMixin is a dataclass that configures the parameters used in both the set input feature
and the set global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=SET)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=SET,
default="embed",
)
@DeveloperAPI
@ecd_input_config_registry.register(SET)
@ludwig_dataclass
class SetInputFeatureConfig(SetInputFeatureConfigMixin, BaseInputFeatureConfig):
"""SetInputFeatureConfig is a dataclass that configures the parameters used for a set input feature."""
type: str = schema_utils.ProtectedString(SET)
@DeveloperAPI
@output_mixin_registry.register(SET)
@ludwig_dataclass
class SetOutputFeatureConfigMixin(BaseMarshmallowConfig):
"""SetOutputFeatureConfigMixin is a dataclass that configures the parameters used in both the set output
feature and the set global defaults section of the Ludwig Config."""
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=SET,
default="classifier",
)
loss: BaseLossConfig = LossDataclassField(
feature_type=SET,
default=SIGMOID_CROSS_ENTROPY,
)
@DeveloperAPI
@ecd_output_config_registry.register(SET)
@ludwig_dataclass
class SetOutputFeatureConfig(SetOutputFeatureConfigMixin, BaseOutputFeatureConfig):
"""SetOutputFeatureConfig is a dataclass that configures the parameters used for a set output feature."""
type: str = schema_utils.ProtectedString(SET)
default_validation_metric: str = schema_utils.StringOptions(
[JACCARD],
default=JACCARD,
description="Internal only use parameter: default validation metric for set output feature.",
parameter_metadata=INTERNAL_ONLY,
)
dependencies: list = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
parameter_metadata=FEATURE_METADATA[SET]["dependencies"],
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="set_output")
reduce_dependencies: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce the dependencies of the output feature.",
parameter_metadata=FEATURE_METADATA[SET]["reduce_dependencies"],
)
reduce_input: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=FEATURE_METADATA[SET]["reduce_input"],
)
threshold: float = schema_utils.FloatRange(
default=0.5,
min=0,
max=1,
description="The threshold used to convert output probabilities to predictions. Tokens with predicted"
"probabilities greater than or equal to threshold are predicted to be in the output set (True).",
parameter_metadata=FEATURE_METADATA[SET]["threshold"],
)
@DeveloperAPI
@ecd_defaults_config_registry.register(SET)
@ludwig_dataclass
class SetDefaultsConfig(SetInputFeatureConfigMixin, SetOutputFeatureConfigMixin):
pass
================================================
FILE: ludwig/schema/features/text_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
LOSS,
MODEL_ECD,
MODEL_LLM,
NEXT_TOKEN_SOFTMAX_CROSS_ENTROPY,
SEQUENCE_SOFTMAX_CROSS_ENTROPY,
TEXT,
)
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import DecoderDataclassField
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.loss.loss import BaseLossConfig
from ludwig.schema.features.loss.utils import LossDataclassField
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
llm_defaults_config_registry,
llm_input_config_registry,
llm_output_config_registry,
output_mixin_registry,
)
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@input_mixin_registry.register(TEXT)
@ludwig_dataclass
class TextInputFeatureConfigMixin(BaseMarshmallowConfig):
"""TextInputFeatureConfigMixin is a dataclass that configures the parameters used in both the text input
feature and the text global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=TEXT)
@DeveloperAPI
@ludwig_dataclass
class TextInputFeatureConfig(TextInputFeatureConfigMixin, BaseInputFeatureConfig):
"""TextInputFeatureConfig is a dataclass that configures the parameters used for a text input feature."""
type: str = schema_utils.ProtectedString(TEXT)
encoder: BaseEncoderConfig = None
@DeveloperAPI
@ecd_input_config_registry.register(TEXT)
@ludwig_dataclass
class ECDTextInputFeatureConfig(TextInputFeatureConfig):
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=TEXT,
default="parallel_cnn",
)
@DeveloperAPI
@llm_input_config_registry.register(TEXT)
@ludwig_dataclass
class LLMTextInputFeatureConfig(TextInputFeatureConfig):
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="text_llm_input")
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_LLM,
feature_type=TEXT,
default="passthrough",
)
@DeveloperAPI
@output_mixin_registry.register(TEXT)
@ludwig_dataclass
class TextOutputFeatureConfigMixin(BaseMarshmallowConfig):
"""TextOutputFeatureConfigMixin is a dataclass that configures the parameters used in both the text output
feature and the text global defaults section of the Ludwig Config."""
decoder: BaseDecoderConfig = None
loss: BaseLossConfig = LossDataclassField(
feature_type=TEXT,
default=SEQUENCE_SOFTMAX_CROSS_ENTROPY,
)
@DeveloperAPI
@ludwig_dataclass
class TextOutputFeatureConfig(TextOutputFeatureConfigMixin, BaseOutputFeatureConfig):
"""TextOutputFeatureConfig is a dataclass that configures the parameters used for a text output feature."""
type: str = schema_utils.ProtectedString(TEXT)
class_similarities: list = schema_utils.List(
list,
default=None,
description="If not null this parameter is a c x c matrix in the form of a list of lists that contains the "
"mutual similarity of classes. It is used if `class_similarities_temperature` is greater than 0. ",
parameter_metadata=FEATURE_METADATA[TEXT]["class_similarities"],
)
default_validation_metric: str = schema_utils.StringOptions(
[LOSS],
default=LOSS,
description="Internal only use parameter: default validation metric for binary output feature.",
parameter_metadata=INTERNAL_ONLY,
)
dependencies: list = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
parameter_metadata=FEATURE_METADATA[TEXT]["dependencies"],
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="text_output")
reduce_dependencies: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce the dependencies of the output feature.",
parameter_metadata=FEATURE_METADATA[TEXT]["reduce_dependencies"],
)
reduce_input: str = schema_utils.ReductionOptions(
default="sum",
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=FEATURE_METADATA[TEXT]["reduce_input"],
)
@DeveloperAPI
@ecd_output_config_registry.register(TEXT)
@ludwig_dataclass
class ECDTextOutputFeatureConfig(TextOutputFeatureConfig):
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=TEXT,
default="generator",
)
@DeveloperAPI
@llm_output_config_registry.register(TEXT)
@ludwig_dataclass
class LLMTextOutputFeatureConfig(TextOutputFeatureConfig):
default_validation_metric: str = schema_utils.StringOptions(
[LOSS],
default=LOSS,
description="Internal only use parameter: default validation metric for text output feature for LLMs.",
parameter_metadata=INTERNAL_ONLY,
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="text_llm_output")
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_LLM,
feature_type=TEXT,
default="text_extractor",
)
loss: BaseLossConfig = LossDataclassField(
feature_type=TEXT,
default=NEXT_TOKEN_SOFTMAX_CROSS_ENTROPY,
)
@DeveloperAPI
@ecd_defaults_config_registry.register(TEXT)
@ludwig_dataclass
class ECDTextDefaultsConfig(TextInputFeatureConfigMixin, TextOutputFeatureConfigMixin):
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=TEXT,
default="parallel_cnn",
)
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=TEXT,
default="generator",
)
loss: BaseLossConfig = LossDataclassField(
feature_type=TEXT,
default=SEQUENCE_SOFTMAX_CROSS_ENTROPY,
)
@DeveloperAPI
@llm_defaults_config_registry.register(TEXT)
@ludwig_dataclass
class LLMTextDefaultsConfig(TextInputFeatureConfigMixin, TextOutputFeatureConfigMixin):
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_LLM,
feature_type=TEXT,
default="passthrough",
)
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_LLM,
feature_type=TEXT,
default="text_extractor",
)
# TODO(Arnav): Refactor LossDataclassField to only accept loss types that are valid for the model
loss: BaseLossConfig = LossDataclassField(
feature_type=TEXT,
default=NEXT_TOKEN_SOFTMAX_CROSS_ENTROPY,
)
================================================
FILE: ludwig/schema/features/timeseries_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import HUBER, MEAN_SQUARED_ERROR, MODEL_ECD, TIMESERIES, VECTOR
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import DecoderDataclassField
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.loss.loss import BaseLossConfig
from ludwig.schema.features.loss.utils import LossDataclassField
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
output_mixin_registry,
)
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@input_mixin_registry.register(TIMESERIES)
@ludwig_dataclass
class TimeseriesInputFeatureConfigMixin(BaseMarshmallowConfig):
"""TimeseriesInputFeatureConfigMixin is a dataclass that configures the parameters used in both the timeseries
input feature and the timeseries global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=TIMESERIES)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=TIMESERIES,
default="parallel_cnn",
)
@DeveloperAPI
@ecd_input_config_registry.register(TIMESERIES)
@ludwig_dataclass
class TimeseriesInputFeatureConfig(TimeseriesInputFeatureConfigMixin, BaseInputFeatureConfig):
"""TimeseriesInputFeatureConfig is a dataclass that configures the parameters used for a timeseries input
feature."""
type: str = schema_utils.ProtectedString(TIMESERIES)
@DeveloperAPI
@output_mixin_registry.register(TIMESERIES)
@ludwig_dataclass
class TimeseriesOutputFeatureConfigMixin(BaseMarshmallowConfig):
"""TimeseriesOutputFeatureConfigMixin configures the parameters used in both the timeseries output feature and
the timeseries global defaults section of the Ludwig Config."""
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=TIMESERIES,
default="projector",
)
loss: BaseLossConfig = LossDataclassField(
feature_type=TIMESERIES,
default=HUBER,
)
@DeveloperAPI
@ecd_output_config_registry.register(TIMESERIES)
@ludwig_dataclass
class TimeseriesOutputFeatureConfig(BaseOutputFeatureConfig, TimeseriesOutputFeatureConfigMixin):
"""TimeseriesOutputFeatureConfig configures the parameters used for a timeseries output feature."""
type: str = schema_utils.ProtectedString(TIMESERIES)
dependencies: list = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
parameter_metadata=FEATURE_METADATA[VECTOR]["dependencies"],
)
default_validation_metric: str = schema_utils.StringOptions(
[MEAN_SQUARED_ERROR],
default=MEAN_SQUARED_ERROR,
description="Internal parameter.",
parameter_metadata=INTERNAL_ONLY,
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="timeseries_output")
reduce_dependencies: str = schema_utils.ReductionOptions(
default=None,
description="How to reduce the dependencies of the output feature.",
parameter_metadata=FEATURE_METADATA[VECTOR]["reduce_dependencies"],
)
reduce_input: str = schema_utils.ReductionOptions(
default=None,
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=FEATURE_METADATA[VECTOR]["reduce_input"],
)
horizon: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Internal parameter. Obtained from preprocessing",
parameter_metadata=INTERNAL_ONLY,
)
@DeveloperAPI
@ecd_defaults_config_registry.register(TIMESERIES)
@ludwig_dataclass
class TimeseriesDefaultsConfig(TimeseriesInputFeatureConfigMixin, TimeseriesOutputFeatureConfigMixin):
pass
================================================
FILE: ludwig/schema/features/utils.py
================================================
from collections import defaultdict
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MODEL_ECD, MODEL_LLM
from ludwig.schema import utils as schema_utils
from ludwig.utils.registry import Registry
input_config_registries = defaultdict(Registry)
output_config_registries = defaultdict(Registry)
ecd_input_config_registry = input_config_registries[MODEL_ECD]
llm_input_config_registry = input_config_registries[MODEL_LLM]
ecd_output_config_registry = output_config_registries[MODEL_ECD]
llm_output_config_registry = output_config_registries[MODEL_LLM]
input_mixin_registry = Registry()
output_mixin_registry = Registry()
"""ECD models support the full range of feature parameters available in Ludwig, so any feature schema can be
registered into it.
See `BinaryDefaultsConfig` for an example.
"""
ecd_defaults_config_registry = Registry()
llm_defaults_config_registry = Registry()
def input_config_registry(model_type: str) -> Registry:
return input_config_registries[model_type]
def output_config_registry(model_type: str) -> Registry:
return output_config_registries[model_type]
@DeveloperAPI
def get_input_feature_cls(model_type: str, name: str):
# TODO(travis): not needed once we remove existing model config implementation
return input_config_registries[model_type][name]
@DeveloperAPI
def get_output_feature_cls(model_type: str, name: str):
# TODO(ksbrar): What is this?
return output_config_registries[model_type][name]
@DeveloperAPI
def get_input_feature_jsonschema(model_type: str):
"""This function returns a JSON schema structured to only requires a `type` key and then conditionally applies
a corresponding input feature's field constraints.
Returns: JSON Schema
"""
input_feature_types = sorted(list(input_config_registry(model_type).keys()))
schema = {
"type": "object",
"properties": {
"name": {"type": "string", "title": "name", "description": "Name of the input feature."},
"type": {
"type": "string",
"enum": input_feature_types,
"title": "type",
"description": "Type of the input feature",
},
"column": {"type": "string", "title": "column", "description": "Name of the column."},
},
"additionalProperties": True,
"allOf": get_input_feature_conds(model_type),
"required": ["name", "type"],
"title": "input_feature",
}
return schema
@DeveloperAPI
def get_input_feature_conds(model_type: str):
"""This function returns a list of if-then JSON clauses for each input feature type along with their properties
and constraints.
Returns: List of JSON clauses
"""
input_feature_types = sorted(list(input_config_registry(model_type).keys()))
conds = []
for feature_type in input_feature_types:
schema_cls = get_input_feature_cls(model_type, feature_type)
feature_schema = schema_utils.unload_jsonschema_from_marshmallow_class(schema_cls)
feature_props = feature_schema["properties"]
schema_utils.remove_duplicate_fields(feature_props)
feature_cond = schema_utils.create_cond({"type": feature_type}, feature_props)
conds.append(feature_cond)
return conds
@DeveloperAPI
def get_output_feature_jsonschema(model_type: str):
"""This function returns a JSON schema structured to only requires a `type` key and then conditionally applies
a corresponding output feature's field constraints.
Returns: JSON Schema
"""
output_feature_types = sorted(list(output_config_registry(model_type).keys()))
schema = {
"type": "object",
"properties": {
"name": {"type": "string", "title": "name", "description": "Name of the output feature."},
"type": {
"type": "string",
"enum": output_feature_types,
"title": "type",
"description": "Type of the output feature",
},
"column": {"type": "string", "title": "column", "description": "Name of the column."},
},
"additionalProperties": True,
"allOf": get_output_feature_conds(model_type),
"required": ["name", "type"],
"title": "output_feature",
}
return schema
@DeveloperAPI
def get_output_feature_conds(model_type: str):
"""This function returns a list of if-then JSON clauses for each output feature type along with their
properties and constraints.
Returns: List of JSON clauses
"""
output_feature_types = sorted(list(output_config_registry(model_type).keys()))
conds = []
for feature_type in output_feature_types:
schema_cls = get_output_feature_cls(model_type, feature_type)
feature_schema = schema_utils.unload_jsonschema_from_marshmallow_class(schema_cls)
feature_props = feature_schema["properties"]
schema_utils.remove_duplicate_fields(feature_props)
feature_cond = schema_utils.create_cond({"type": feature_type}, feature_props)
conds.append(feature_cond)
return conds
================================================
FILE: ludwig/schema/features/vector_feature.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MEAN_SQUARED_ERROR, MODEL_ECD, VECTOR
from ludwig.schema import utils as schema_utils
from ludwig.schema.decoders.base import BaseDecoderConfig
from ludwig.schema.decoders.utils import DecoderDataclassField
from ludwig.schema.encoders.base import BaseEncoderConfig
from ludwig.schema.encoders.utils import EncoderDataclassField
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig
from ludwig.schema.features.loss.loss import BaseLossConfig
from ludwig.schema.features.loss.utils import LossDataclassField
from ludwig.schema.features.preprocessing.base import BasePreprocessingConfig
from ludwig.schema.features.preprocessing.utils import PreprocessingDataclassField
from ludwig.schema.features.utils import (
ecd_defaults_config_registry,
ecd_input_config_registry,
ecd_output_config_registry,
input_mixin_registry,
output_mixin_registry,
)
from ludwig.schema.metadata import FEATURE_METADATA
from ludwig.schema.metadata.parameter_metadata import INTERNAL_ONLY
from ludwig.schema.utils import BaseMarshmallowConfig, ludwig_dataclass
@DeveloperAPI
@input_mixin_registry.register(VECTOR)
@ludwig_dataclass
class VectorInputFeatureConfigMixin(BaseMarshmallowConfig):
"""VectorInputFeatureConfigMixin is a dataclass that configures the parameters used in both the vector input
feature and the vector global defaults section of the Ludwig Config."""
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type=VECTOR)
encoder: BaseEncoderConfig = EncoderDataclassField(
MODEL_ECD,
feature_type=VECTOR,
default="dense",
)
@DeveloperAPI
@ecd_input_config_registry.register(VECTOR)
@ludwig_dataclass
class VectorInputFeatureConfig(VectorInputFeatureConfigMixin, BaseInputFeatureConfig):
"""VectorInputFeatureConfig is a dataclass that configures the parameters used for a vector input feature."""
type: str = schema_utils.ProtectedString(VECTOR)
@DeveloperAPI
@output_mixin_registry.register(VECTOR)
@ludwig_dataclass
class VectorOutputFeatureConfigMixin(BaseMarshmallowConfig):
"""VectorOutputFeatureConfigMixin is a dataclass that configures the parameters used in both the vector output
feature and the vector global defaults section of the Ludwig Config."""
decoder: BaseDecoderConfig = DecoderDataclassField(
MODEL_ECD,
feature_type=VECTOR,
default="projector",
)
loss: BaseLossConfig = LossDataclassField(
feature_type=VECTOR,
default=MEAN_SQUARED_ERROR,
)
@DeveloperAPI
@ecd_output_config_registry.register(VECTOR)
@ludwig_dataclass
class VectorOutputFeatureConfig(VectorOutputFeatureConfigMixin, BaseOutputFeatureConfig):
"""VectorOutputFeatureConfig is a dataclass that configures the parameters used for a vector output feature."""
type: str = schema_utils.ProtectedString(VECTOR)
dependencies: list = schema_utils.List(
default=[],
description="List of input features that this feature depends on.",
parameter_metadata=FEATURE_METADATA[VECTOR]["dependencies"],
)
default_validation_metric: str = schema_utils.StringOptions(
[MEAN_SQUARED_ERROR],
default=MEAN_SQUARED_ERROR,
description="Internal only use parameter: default validation metric for binary output feature.",
parameter_metadata=INTERNAL_ONLY,
)
preprocessing: BasePreprocessingConfig = PreprocessingDataclassField(feature_type="vector_output")
reduce_dependencies: str = schema_utils.ReductionOptions(
default=None,
description="How to reduce the dependencies of the output feature.",
parameter_metadata=FEATURE_METADATA[VECTOR]["reduce_dependencies"],
)
reduce_input: str = schema_utils.ReductionOptions(
default=None,
description="How to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first "
"dimension (second if you count the batch dimension)",
parameter_metadata=FEATURE_METADATA[VECTOR]["reduce_input"],
)
softmax: bool = schema_utils.Boolean(
default=False,
description="Determines whether to apply a softmax at the end of the decoder. This is useful for predicting a "
"vector of values that sum up to 1 and can be interpreted as probabilities.",
parameter_metadata=FEATURE_METADATA[VECTOR]["softmax"],
)
vector_size: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The size of the vector. If None, the vector size will be inferred from the data.",
parameter_metadata=FEATURE_METADATA[VECTOR]["vector_size"],
)
@DeveloperAPI
@ecd_defaults_config_registry.register(VECTOR)
@ludwig_dataclass
class VectorDefaultsConfig(VectorInputFeatureConfigMixin, VectorOutputFeatureConfigMixin):
pass
================================================
FILE: ludwig/schema/hyperopt/__init__.py
================================================
from abc import ABC
import ludwig.schema.hyperopt.parameter # noqa: F401
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import LOSS, TEST, TRAIN, VALIDATION
from ludwig.modules import metric_modules # noqa: Needed to ensure that the metric registry is populated.
from ludwig.modules.metric_registry import get_metric_registry
from ludwig.schema import utils as schema_utils
from ludwig.schema.hyperopt.executor import ExecutorConfig, ExecutorDataclassField
from ludwig.schema.hyperopt.search_algorithm import BaseSearchAlgorithmConfig, SearchAlgorithmDataclassField
from ludwig.schema.utils import ludwig_dataclass as dataclass
@DeveloperAPI
@dataclass
class HyperoptConfig(schema_utils.BaseMarshmallowConfig, ABC):
"""Basic hyperopt settings."""
output_feature: str = schema_utils.String( # TODO: make more restrictive
default="combined",
description=(
"The name of the output feature that we want to optimize the metric or loss of. Available values "
"are `combined` or the name of any output feature provided in the configuration. `combined` is a special "
"output feature that allows to optimize for the aggregated loss and metrics of all output features."
),
)
goal: str = schema_utils.StringOptions(
options=["minimize", "maximize"],
default="minimize",
allow_none=False,
description=(
"Indicates if to minimize or maximize a metric or a loss of any of the output features on any of the "
"dataset splits. Available values are: minimize (default) or maximize."
),
)
metric: str = schema_utils.StringOptions(
options=get_metric_registry().keys(),
default=LOSS,
allow_none=False,
description=(
"The metric that we want to optimize for. The default one is loss, but depending on the type of the "
"feature defined in output_feature, different metrics and losses are available. Check the metrics section "
"of the specific output feature type to figure out what metrics are available to use."
),
)
split: str = schema_utils.StringOptions(
options=[TRAIN, VALIDATION, TEST],
default=VALIDATION,
allow_none=False,
description=(
"The split of data that we want to compute our metric on. By default it is the validation split, but "
"you have the flexibility to specify also train or test splits."
),
)
eval_split: str = schema_utils.StringOptions(
options=[TRAIN, VALIDATION, TEST],
default=VALIDATION,
allow_none=False,
description=(
"The split of data that we want to run evaluation on. By default it is the validation split, but "
"you have the flexibility to specify also train or test splits."
),
)
search_alg: BaseSearchAlgorithmConfig = SearchAlgorithmDataclassField(
description=(
"Specifies the algorithm to sample the defined parameters space. Candidate algorithms are those "
"found in [Ray Tune's Search Algorithms](https://docs.ray.io/en/latest/tune/api/suggestion.html)."
)
)
executor: ExecutorConfig = ExecutorDataclassField(
description=(
"specifies how to execute the hyperparameter optimization. The execution could happen locally in a serial "
"manner or in parallel across multiple workers and with GPUs as well if available. The executor section "
"includes specification for work scheduling and the number of samples to generate."
)
)
parameters: dict = schema_utils.Dict(
allow_none=False,
description=(
"This section consists of a set of hyperparameters to optimize. They are provided as keys (the names of "
"the parameters) and values associated with them (that define the search space). The values vary depending "
"on the type of the hyperparameter. Syntax for this section is based on [Ray Tune's Search Space "
"parameters](https://docs.ray.io/en/latest/tune/api/search_space.html)."
),
)
@DeveloperAPI
def get_hyperopt_jsonschema():
props = schema_utils.unload_jsonschema_from_marshmallow_class(HyperoptConfig)["properties"]
return {
"type": ["object", "null"],
"properties": props,
"title": "hyperopt_options",
"description": "Settings for hyperopt",
}
@DeveloperAPI
class HyperoptField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(HyperoptConfig, default_missing=True)
def _jsonschema_type_mapping(self):
return get_hyperopt_jsonschema()
================================================
FILE: ludwig/schema/hyperopt/executor.py
================================================
from dataclasses import field
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import RAY
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.hyperopt.scheduler import BaseSchedulerConfig, SchedulerDataclassField
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class ExecutorConfig(schema_utils.BaseMarshmallowConfig):
"""Basic executor settings."""
type: str = schema_utils.ProtectedString(RAY)
num_samples: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description=(
"This parameter, along with the `space` specifications in the `parameters` section, controls how many "
"trials are generated."
),
)
time_budget_s: int = schema_utils.PositiveInteger(
default=3600, allow_none=True, description="The number of seconds for the entire hyperopt run."
)
trial_driver_resources: dict[str, float] = schema_utils.Dict(
default=None,
description=(
"The resources reserved by each trial driver. This differs from cpu_resources_per_trial and "
"gpu_resources_per_trial because these resources are reserved for the driver, not its subsequent "
"workers. Only used when the trials themselves are on the Ray backend. Defaults to 1 CPU."
),
)
cpu_resources_per_trial: int = schema_utils.PositiveInteger(
default=1, description="The number of CPU cores allocated to each trial"
)
gpu_resources_per_trial: int = schema_utils.NonNegativeInteger(
default=0, description="The number of GPU devices allocated to each trial"
)
kubernetes_namespace: str | None = schema_utils.String(
default=None,
allow_none=True,
description=(
"When running on Kubernetes, provide the namespace of the Ray cluster to sync results between "
"pods. See the Ray docs for more info."
),
)
max_concurrent_trials: str | int | None = schema_utils.OneOfOptionsField(
default="auto",
allow_none=True,
description=("The maximum number of trials to train concurrently. Defaults to auto if not specified."),
field_options=[
schema_utils.PositiveInteger(
default=1, allow_none=False, description="Manually set a number of concurrent trials."
),
schema_utils.StringOptions(
options=["auto"],
default="auto",
allow_none=False,
description="Automatically set number of concurrent trials.",
),
],
)
scheduler: BaseSchedulerConfig = SchedulerDataclassField(description="")
@DeveloperAPI
def ExecutorDataclassField(description: str, default: dict = {}):
class ExecutorMarshmallowField(schema_utils.LudwigSchemaField):
def _deserialize(self, value, attr, data, **kwargs):
if isinstance(value, dict):
try:
return ExecutorConfig.Schema().load(value)
except (TypeError, ConfigValidationError):
raise ConfigValidationError(f"Invalid params for executor: {value}, see ExecutorConfig class.")
raise ConfigValidationError("Field should be dict")
def _jsonschema_type_mapping(self):
return {
**schema_utils.unload_jsonschema_from_marshmallow_class(ExecutorConfig),
"title": "executor",
"description": description,
}
if not isinstance(default, dict):
raise ConfigValidationError(f"Invalid default: `{default}`")
load_default = lambda: ExecutorConfig.Schema().load(default)
dump_default = ExecutorConfig.Schema().dump(default)
return field(
metadata={
"marshmallow_field": ExecutorMarshmallowField(
allow_none=False,
load_default=load_default,
dump_default=dump_default,
metadata={"description": description, "parameter_metadata": None},
)
},
default_factory=load_default,
)
================================================
FILE: ludwig/schema/hyperopt/parameter.py
================================================
from pydantic.fields import FieldInfo
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.hyperopt.utils import register_parameter_config
from ludwig.schema.utils import ludwig_dataclass
def quantization_number_field(dtype: type[float] | type[int] = float, default=None) -> FieldInfo:
description = (
"Quantization number. Output values will be rounded to the nearest increment of `q` in range."
"Quantization makes the upper bound inclusive."
)
if dtype is int:
field = schema_utils.Integer(default=default, allow_none=True, description=description)
else:
field = schema_utils.FloatRange(default=default, allow_none=True, description=description)
return field
def log_base_field(default: float = 10) -> FieldInfo:
return schema_utils.FloatRange(default=default, description="Logarithmic base.")
@DeveloperAPI
@register_parameter_config("choice")
@ludwig_dataclass
class ChoiceParameterConfig(schema_utils.BaseMarshmallowConfig):
"""Config for a randomly sampled categorical search space."""
space: str = schema_utils.ProtectedString("choice")
categories: list = schema_utils.OneOfOptionsField(
default=None,
allow_none=True,
description=(
"The list of values to use in creating the categorical space. The type of each value of the list is "
"general, i.e., they could be strings, integers, floats and anything else, even entire dictionaries."
),
field_options=[
schema_utils.List(list_type=float, allow_none=False, description="The list of floats to randomly sample."),
schema_utils.List(list_type=int, allow_none=False, description="The list of integers to randomly sample."),
schema_utils.List(list_type=str, allow_none=False, description="The list of strings to randomly sample."),
schema_utils.List(
list_type=list,
inner_type=dict,
allow_none=False,
description="The list of lists of configs to randomly sample.",
),
schema_utils.DictList(allow_none=False, description="A list of nested config parameters to sample."),
],
)
@DeveloperAPI
@register_parameter_config("grid_search")
@ludwig_dataclass
class GridSearchParameterConfig(schema_utils.BaseMarshmallowConfig):
"""Config for a grid search space."""
space: str = schema_utils.ProtectedString("grid_search")
values: list = schema_utils.OneOfOptionsField(
default=None,
allow_none=True,
description=(
"The list of values to use in creating the grid search space. The type of each value of the list is "
"general, i.e., they could be strings, integers, floats and anything else, even entire dictionaries."
),
field_options=[
schema_utils.List(list_type=float, allow_none=False, description="The list of floats to randomly sample."),
schema_utils.List(list_type=int, allow_none=False, description="The list of integers to randomly sample."),
schema_utils.List(list_type=str, allow_none=False, description="The list of strings to randomly sample."),
],
)
@DeveloperAPI
@register_parameter_config("uniform")
@ludwig_dataclass
class UniformParameterConfig(schema_utils.BaseMarshmallowConfig):
"""Config for a real-valued uniform search space."""
space: str = schema_utils.ProtectedString("uniform")
lower: float = schema_utils.FloatRange(default=None, description="The minimum value the parameter can have.")
upper: float = schema_utils.FloatRange(default=None, description="The maximum value the parameter can have.")
@DeveloperAPI
@register_parameter_config("quniform")
@ludwig_dataclass
class QUniformParameterConfig(UniformParameterConfig):
"""Config for a real-valued uniform search space with quantization."""
space: str = schema_utils.ProtectedString("quniform")
q: float = quantization_number_field()
@DeveloperAPI
@register_parameter_config("loguniform")
@ludwig_dataclass
class LogUniformParameterConfig(UniformParameterConfig):
"""Config for a log-scaled real-valued uniform numeric search space."""
space: str = schema_utils.ProtectedString("loguniform")
base: float = log_base_field()
@DeveloperAPI
@register_parameter_config("qloguniform")
@ludwig_dataclass
class QLogUniformParameterConfig(UniformParameterConfig):
"""Config for a log-scaled real-valued uniform search space with quantization."""
space: str = schema_utils.ProtectedString("qloguniform")
q: float = quantization_number_field()
base: float = log_base_field()
@DeveloperAPI
@register_parameter_config("randn")
@ludwig_dataclass
class RandnParameterConfig(schema_utils.BaseMarshmallowConfig):
"""Config for a Gaussian search space."""
space: str = schema_utils.ProtectedString("randn")
mean: float = schema_utils.FloatRange(default=0.0, description="Mean of the normal distribution.")
sd: float = schema_utils.FloatRange(default=1.0, description="Standard deviation of the normal distribution.")
@DeveloperAPI
@register_parameter_config("qrandn")
@ludwig_dataclass
class QRandnParameterConfig(RandnParameterConfig):
"""Config for a Gaussian search space with quantization."""
space: str = schema_utils.ProtectedString("qrandn")
q: float = quantization_number_field()
@DeveloperAPI
@register_parameter_config("randint")
@ludwig_dataclass
class RandintParameterConfig(schema_utils.BaseMarshmallowConfig):
"""Config for an integer-valued uniform search space."""
space: str = schema_utils.ProtectedString("randint")
lower: int = schema_utils.Integer(default=None, description="The minimum value the parameter can have.")
upper: int = schema_utils.Integer(default=None, description="The maximum value the parameter can have.")
@DeveloperAPI
@register_parameter_config("qrandint")
@ludwig_dataclass
class QRandintParameterConfig(RandintParameterConfig):
"""Config for an integer-valued uniform search space with quantization."""
space: str = schema_utils.ProtectedString("qrandint")
q: int = quantization_number_field(dtype=int)
@DeveloperAPI
@register_parameter_config("lograndint")
@ludwig_dataclass
class LogRandintParameterConfig(RandintParameterConfig):
"""Config for an log-scaled integer-valued search space."""
space: str = schema_utils.ProtectedString("lograndint")
base: float = log_base_field()
@DeveloperAPI
@register_parameter_config("qlograndint")
@ludwig_dataclass
class QLogRandintParameterConfig(RandintParameterConfig):
"""Config for an log-scaled integer-valued search space with quantization."""
space: str = schema_utils.ProtectedString("qlograndint")
q: int = quantization_number_field(dtype=int)
base: float = log_base_field()
================================================
FILE: ludwig/schema/hyperopt/scheduler.py
================================================
from abc import ABC
from collections.abc import Callable
from dataclasses import field
from importlib import import_module
from ludwig.api_annotations import DeveloperAPI
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.hyperopt import utils as hyperopt_utils
from ludwig.schema.utils import ludwig_dataclass
# ----------------------------------------------------------------------------------------------------------------------
# To prevent direct dependency on ray import, the following static key stores are duplicated:
# from ray.tune.schedulers import SCHEDULER_IMPORT
# https://github.com/ray-project/ray/blob/137a1b12c3b31a3622fa5f721a05a64e9b559b05/python/ray/tune/schedulers/__init__.py#L28
# from ray.tune.result import DEFAULT_RESULT_KEYS
# Taken from https://github.com/ray-project/ray/blob/137a1b12c3b31a3622fa5f721a05a64e9b559b05/python/ray/tune/result.py
TRAINING_ITERATION = "training_iteration"
TIME_TOTAL_S = "time_total_s"
TIMESTEPS_TOTAL = "timesteps_total"
MEAN_ACCURACY = "mean_accuracy"
MEAN_LOSS = "mean_loss"
DEFAULT_RESULT_KEYS = (TRAINING_ITERATION, TIME_TOTAL_S, TIMESTEPS_TOTAL, MEAN_ACCURACY, MEAN_LOSS)
# from ray.tune.result import DEFAULT_METRIC
RAY_TUNE_DESULT_DEFAULT_METRIC = "_metric"
# ----------------------------------------------------------------------------------------------------------------------
# Field aliases to cut down on code reuse:
@DeveloperAPI
def metric_alias(default=None):
return schema_utils.StringOptions(
options=list(DEFAULT_RESULT_KEYS) + [RAY_TUNE_DESULT_DEFAULT_METRIC],
default=default,
allow_none=default is None,
description=(
"The training result objective value attribute. Stopping procedures will use this attribute. If None but a "
"mode was passed, the ray.tune.result.DEFAULT_METRIC will be used per default."
),
)
@DeveloperAPI
def time_attr_alias(default=TRAINING_ITERATION):
return schema_utils.StringOptions(
options=list(DEFAULT_RESULT_KEYS),
default=default,
allow_none=False,
description=(
"A training result attr to use for comparing time. Note that you can pass in something non-temporal such as"
" training_iteration as a measure of progress, the only requirement is that the attribute should increase "
"monotonically."
),
)
@DeveloperAPI
def max_t_alias(default=100):
return schema_utils.PositiveInteger(
default=default,
description=(
"max time units per trial. Trials will be stopped after max_t time units (determined by time_attr) have "
"passed."
),
)
@DeveloperAPI
@ludwig_dataclass
class BaseSchedulerConfig(schema_utils.BaseMarshmallowConfig, ABC):
"""Base class for schedulers.
Not meant to be used directly.
"""
type: str
"""Name corresponding to a scheduler in `ludwig.schema.hyperopt.scheduler.scheduler_registry`.
Technically mutable, but attempting to load a derived scheduler with `type` set to a mismatched value will result in
a `ValidationError`.
"""
time_attr: str = time_attr_alias()
metric: str | None = metric_alias()
mode: str | None = schema_utils.StringOptions(
options=["min", "max"],
default=None,
allow_none=True,
description=(
"One of {min, max}. Determines whether objective is minimizing or maximizing the metric attribute."
),
)
def dependencies_installed(self):
"""Some search algorithms require additional packages to be installed, check that they are available."""
missing_packages = []
missing_installs = []
for package_name, install_name in hyperopt_utils.get_scheduler_dependencies(self.type):
try:
import_module(package_name)
except ImportError:
missing_packages.append(package_name)
missing_installs.append(install_name)
if missing_packages:
missing_packages = ", ".join(missing_packages)
missing_installs = " ".join(missing_installs)
raise ImportError(
f"Some packages needed to use hyperopt scheduler {self.type} are not installed: "
f"{missing_packages}. To add these dependencies, run `pip install {missing_installs}`. For more "
"details, please refer to Ray Tune documentation for this scheduler."
)
return True
@DeveloperAPI
@ludwig_dataclass
class BaseHyperbandSchedulerConfig(BaseSchedulerConfig):
max_t: int = max_t_alias()
@DeveloperAPI
@hyperopt_utils.register_scheduler_config("async_hyperband")
@hyperopt_utils.register_scheduler_config("asynchyperband")
@hyperopt_utils.register_scheduler_config("asha")
@ludwig_dataclass
class AsyncHyperbandSchedulerConfig(BaseHyperbandSchedulerConfig):
"""Asynchronous hyperband (ASHA) scheduler settings."""
type: str = schema_utils.ProtectedString("async_hyperband")
max_t: int = max_t_alias()
grace_period: int = schema_utils.PositiveInteger(
default=1,
description=(
"Only stop trials at least this old in time. The units are the same as the attribute named by `time_attr`."
),
)
reduction_factor: int = schema_utils.NonNegativeFloat(
default=4, description=("Used to set halving rate and amount. This is simply a unit-less scalar.")
)
brackets: int = schema_utils.PositiveInteger(
default=1,
description=(
"Number of brackets. Each bracket has a different halving rate, specified by the reduction factor."
),
)
stop_last_trials: bool = schema_utils.Boolean(
default=True, description="Whether to terminate the trials after reaching `max_t`."
)
@DeveloperAPI
@hyperopt_utils.register_scheduler_config("hyperband")
@ludwig_dataclass
class HyperbandSchedulerConfig(BaseHyperbandSchedulerConfig):
"""Standard hyperband scheduler settings."""
type: str = schema_utils.ProtectedString("hyperband")
max_t: int = max_t_alias(default=81)
reduction_factor: int = schema_utils.NonNegativeFloat(
default=3, description=("Used to set halving rate and amount. This is simply a unit-less scalar.")
)
stop_last_trials: bool = schema_utils.Boolean(
default=True, description=("Whether to terminate the trials after reaching max_t. Defaults to True.")
)
@DeveloperAPI
@hyperopt_utils.register_scheduler_config("median_stopping_rule")
@hyperopt_utils.register_scheduler_config("medianstoppingrule")
@ludwig_dataclass
class MedianStoppingRuleSchedulerConfig(BaseSchedulerConfig):
"""Median Stopping Rule scheduler settings."""
type: str = schema_utils.ProtectedString("median_stopping_rule")
time_attr: str = time_attr_alias(TIME_TOTAL_S)
grace_period: float = schema_utils.NonNegativeFloat(
default=60.0,
description=(
"Only stop trials at least this old in time. The mean will only be computed from this time onwards. The "
"units are the same as the attribute named by `time_attr`."
),
)
min_samples_required: int = schema_utils.PositiveInteger(
default=3, description=("Minimum number of trials to compute median over.")
)
min_time_slice: int = schema_utils.NonNegativeInteger(
default=0,
description=(
"Each trial runs at least this long before yielding (assuming it isn't stopped). Note: trials ONLY yield "
"if there are not enough samples to evaluate performance for the current result AND there are other "
"trials waiting to run. The units are the same as the attribute named by `time_attr`."
),
)
hard_stop: bool = schema_utils.Boolean(
default=True,
description=(
"If False, pauses trials instead of stopping them. When all other trials are complete, paused trials will "
"be resumed and allowed to run FIFO."
),
)
@DeveloperAPI
@hyperopt_utils.register_scheduler_config("pbt")
@ludwig_dataclass
class PopulationBasedTrainingSchedulerConfig(BaseSchedulerConfig):
"""Population Based Training scheduler settings."""
type: str = schema_utils.ProtectedString("pbt")
time_attr: str = time_attr_alias(TIME_TOTAL_S)
perturbation_interval: float = schema_utils.NonNegativeFloat(
default=60.0,
description=(
"Models will be considered for perturbation at this interval of `time_attr`. Note that perturbation incurs "
"checkpoint overhead, so you shouldn't set this to be too frequent."
),
)
burn_in_period: float = schema_utils.NonNegativeFloat(
default=60.0,
description=(
"Models will not be considered for perturbation before this interval of time_attr has passed. This "
"guarantees that models are trained for at least a certain amount of time or timesteps before being "
"perturbed."
),
)
hyperparam_mutations: dict | None = schema_utils.Dict(
default=None,
description=(
"Hyperparams to mutate. The format is as follows: for each key, either a list, function, or a tune search "
"space object (`tune.loguniform`, tune.uniform, etc.) can be provided. A list specifies an allowed set of "
"categorical values. A function or tune search space object specifies the distribution of a continuous "
"parameter. You must use `tune.choice`, `tune.uniform`, `tune.loguniform`, etc.. Arbitrary "
"`tune.sample_from` objects are not supported. A key can also hold a dict for nested hyperparameters. You "
"must specify at least one of `hyperparam_mutations` or `custom_explore_fn`. Tune will sample the search "
"space provided by `hyperparam_mutations` for the initial hyperparameter values if the corresponding "
"hyperparameters are not present in a trial's initial config."
),
)
quantile_fraction: float = schema_utils.FloatRange(
default=0.25,
allow_none=False,
min=0,
max=0.5,
description=(
"Parameters are transferred from the top `quantile_fraction` fraction of trials to the bottom "
"`quantile_fraction` fraction. Needs to be between 0 and 0.5. Setting it to 0 essentially implies doing "
"no exploitation at all."
),
)
resample_probability: float = schema_utils.NonNegativeFloat(
default=0.25,
description=(
"The probability of resampling from the original distribution when applying `hyperparam_mutations`. If "
"not resampled, the value will be perturbed by a factor chosen from `perturbation_factors` if continuous, "
"or changed to an adjacent value if discrete."
),
)
perturbation_factors: tuple[float, float] = schema_utils.FloatRangeTupleDataclassField(
default=(1.2, 0.8),
allow_none=False,
max=None,
description=("Scaling factors to choose between when mutating a continuous hyperparameter."),
)
# TODO: Add schema support for Callable
custom_explore_fn: str | Callable = schema_utils.String(
default=None,
allow_none=True,
description=(
"You can also specify a custom exploration function. This function is invoked as `f(config)` after "
"built-in perturbations from `hyperparam_mutations` are applied, and should return config updated as "
"needed. You must specify at least one of `hyperparam_mutations` or `custom_explore_fn`."
),
)
log_config: bool = schema_utils.Boolean(
default=True,
description=(
"Whether to log the ray config of each model to `local_dir` at each exploit. Allows config schedule to be "
"reconstructed."
),
)
require_attrs: bool = schema_utils.Boolean(
default=True,
description=(
"Whether to require `time_attr` and metric to appear in result for every iteration. If True, error will "
"be raised if these values are not present in trial result."
),
)
synch: bool = schema_utils.Boolean(
default=False,
description=(
"If False, will use asynchronous implementation of PBT. Trial perturbations occur every "
"`perturbation_interval` for each trial independently. If True, will use synchronous implementation of "
"PBT. Perturbations will occur only after all trials are synced at the same `time_attr` every "
"`perturbation_interval`. Defaults to False. See Appendix A.1 here https://arxiv.org/pdf/1711.09846.pdf."
),
)
@DeveloperAPI
@hyperopt_utils.register_scheduler_config("pbt_replay")
@ludwig_dataclass
class PopulationBasedTrainingReplaySchedulerConfig(BaseSchedulerConfig):
"""Population Based Training Replay scheduler settings."""
type: str = schema_utils.ProtectedString("pbt_replay")
# TODO: This should technically be a required paremeter. Do we need to add support for required params?
policy_file: str = schema_utils.String(
default=None,
allow_none=True,
description=(
"The PBT policy file. Usually this is stored in `~/ray_results/experiment_name/pbt_policy_xxx.txt` where "
"`xxx` is the trial ID."
),
)
@DeveloperAPI
@hyperopt_utils.register_scheduler_config("pb2", dependencies=[("sklearn", "scikit-learn"), ("GPy", "GPy")])
@ludwig_dataclass
class PopulationBasedBanditsSchedulerConfig(BaseSchedulerConfig):
"""Population Based Bandits (PB2) scheduler settings."""
type: str = schema_utils.ProtectedString("pb2")
time_attr: str = time_attr_alias(TIME_TOTAL_S)
perturbation_interval: float = schema_utils.NonNegativeFloat(
default=60.0,
description=(
"Models will be considered for perturbation at this interval of `time_attr`. Note that perturbation "
"incurs checkpoint overhead, so you shouldn't set this to be too frequent."
),
)
hyperparam_bounds: dict | None = schema_utils.Dict(
default=None,
description=(
"Hyperparameters to mutate. The format is as follows: for each key, enter a list of the form [min, max] "
"representing the minimum and maximum possible hyperparameter values."
),
)
quantile_fraction: float = schema_utils.FloatRange(
default=0.25,
allow_none=False,
min=0,
max=0.5,
description=(
"Parameters are transferred from the top `quantile_fraction` fraction of trials to the bottom "
"`quantile_fraction` fraction. Needs to be between 0 and 0.5. Setting it to 0 essentially implies doing "
"no exploitation at all."
),
)
log_config: bool = schema_utils.Boolean(
default=True,
description=(
"Whether to log the ray config of each model to `local_dir` at each exploit. Allows config schedule to be "
"reconstructed."
),
)
require_attrs: bool = schema_utils.Boolean(
default=True,
description=(
"Whether to require `time_attr` and metric to appear in result for every iteration. If True, error will "
"be raised if these values are not present in trial result."
),
)
synch: bool = schema_utils.Boolean(
default=False,
description=(
"If False, will use asynchronous implementation of PBT. Trial perturbations occur every "
"`perturbation_interval` for each trial independently. If True, will use synchronous implementation of "
"PBT. Perturbations will occur only after all trials are synced at the same `time_attr` every "
"`perturbation_interval`. Defaults to False. See Appendix A.1 here https://arxiv.org/pdf/1711.09846.pdf."
),
)
@DeveloperAPI
@hyperopt_utils.register_scheduler_config("hb_bohb")
@ludwig_dataclass
class BOHBSchedulerConfig(BaseHyperbandSchedulerConfig):
"""Hyperband for BOHB (hb_bohb) scheduler settings."""
type: str = schema_utils.ProtectedString("hb_bohb")
max_t: int = max_t_alias(default=81)
reduction_factor: int = schema_utils.NonNegativeFloat(
default=3, description=("Used to set halving rate and amount. This is simply a unit-less scalar.")
)
stop_last_trials: bool = schema_utils.Boolean(
default=True, description=("Whether to terminate the trials after reaching `max_t`. Defaults to True.")
)
# TODO: Double-check support for this
@DeveloperAPI
@hyperopt_utils.register_scheduler_config("fifo")
@ludwig_dataclass
class FIFOSchedulerConfig(BaseSchedulerConfig):
"""FIFO trial scheduler settings."""
type: str = schema_utils.ProtectedString("fifo")
# TODO: Double-check support for this as well as whether Callable args work properly
@DeveloperAPI
@hyperopt_utils.register_scheduler_config("resource_changing")
@ludwig_dataclass
class ResourceChangingSchedulerConfig(BaseSchedulerConfig):
"""Resource changing scheduler settings."""
type: str = schema_utils.ProtectedString("resource_changing")
base_scheduler: str | None | Callable = schema_utils.String(
default=None,
allow_none=True,
description=("The scheduler to provide decisions about trials. If None, a default FIFOScheduler will be used."),
)
resources_allocation_function: str | Callable = schema_utils.String(
default=None,
allow_none=True,
description=(
"The callable used to change live trial resource requiements during tuning. This callable will be called on"
" each trial as it finishes one step of training. The callable must take four arguments: `TrialRunner`, "
"current `Trial`, current result `dict` and the `ResourceChangingScheduler` calling it. The callable must "
"return a `PlacementGroupFactory`, `Resources`, `dict` or None (signifying no need for an update). If "
"`resources_allocation_function` is None, no resource requirements will be changed at any time. By "
" default, `DistributeResources` will be used, distributing available CPUs and GPUs over all running "
"trials in a robust way, without any prioritization."
),
)
@DeveloperAPI
def get_scheduler_conds():
"""Returns a JSON schema of conditionals to validate against scheduler types defined in
`ludwig.schema.hyperopt.scheduler_registry`."""
conds = []
for scheduler_config in hyperopt_utils.scheduler_config_registry:
scheduler_cls = hyperopt_utils.scheduler_config_registry[scheduler_config]
other_props = schema_utils.unload_jsonschema_from_marshmallow_class(scheduler_cls)["properties"]
schema_utils.remove_duplicate_fields(other_props)
preproc_cond = schema_utils.create_cond(
{"type": scheduler_config},
other_props,
)
conds.append(preproc_cond)
return conds
@DeveloperAPI
def SchedulerDataclassField(default={"type": "fifo"}, description="Hyperopt scheduler settings."):
"""Custom dataclass field that when used inside of a dataclass will allow any scheduler in
`ludwig.schema.hyperopt.scheduler.scheduler_registry`. Sets default scheduler to 'fifo'.
:param default: Dict specifying a scheduler with a `type` field and its associated parameters. Will attempt to use
`type` to load scheduler from registry with given params. (default: {"type": "fifo"}).
:return: Initialized dataclass field that converts untyped dicts with params to scheduler dataclass instances.
"""
class SchedulerMarshmallowField(schema_utils.LudwigSchemaField):
"""Custom field that deserializes a dict to a valid scheduler from
`ludwig.schema.hyperopt.scheduler_registry` and creates a corresponding `oneOf` JSON schema for external
usage."""
def _deserialize(self, value, attr, data, **kwargs):
if value is None:
return None
if isinstance(value, dict):
if "type" in value and value["type"] in hyperopt_utils.scheduler_config_registry:
scheduler_config_cls = hyperopt_utils.scheduler_config_registry[value["type"].lower()]
try:
return scheduler_config_cls.Schema().load(value)
except (TypeError, ConfigValidationError) as e:
raise ConfigValidationError(
f"Invalid params for scheduler: {value}, see `{opt}` definition. Error: {e}"
)
raise ConfigValidationError(
f"Invalid params for scheduler: {value}, expect dict with at least a valid `type` attribute."
)
raise ConfigValidationError("Field should be None or dict")
def _jsonschema_type_mapping(self):
# Note that this uses the same conditional pattern as combiners:
return {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": list(hyperopt_utils.scheduler_config_registry.keys()),
"default": default["type"],
"description": "The type of scheduler to use during hyperopt",
},
},
"title": "scheduler_options",
"allOf": get_scheduler_conds(),
"required": ["type"],
"description": description,
}
if (
not isinstance(default, dict)
or "type" not in default
or default["type"] not in hyperopt_utils.scheduler_config_registry
):
raise ConfigValidationError(f"Invalid default: `{default}`")
try:
opt = hyperopt_utils.scheduler_config_registry[default["type"].lower()]
load_default = lambda: opt.Schema().load(default)
dump_default = opt.Schema().dump(default)
return field(
metadata={
"marshmallow_field": SchedulerMarshmallowField(
allow_none=False,
dump_default=dump_default,
load_default=load_default,
metadata={"description": description},
)
},
default_factory=load_default,
)
except Exception as e:
raise ConfigValidationError(
f"Unsupported scheduler type: {default['type']}. See scheduler_config_registry. Details: {e}"
)
================================================
FILE: ludwig/schema/hyperopt/search_algorithm.py
================================================
from dataclasses import field
from importlib.util import find_spec
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.hyperopt import utils as hyperopt_utils
from ludwig.schema.utils import ludwig_dataclass
def points_to_evaluate_field(description: str | None = None):
return schema_utils.DictList(
description=description
or (
"Initial parameter suggestions to be run first. This is for when you already have some good parameters "
"you want to run first to help the algorithm make better suggestions for future parameters. Needs to be "
"a list of dicts containing the configurations."
),
)
def evaluated_rewards_field(description: str | None = None):
return schema_utils.List(
description=description
or (
"If you have previously evaluated the parameters passed in as points_to_evaluate you can avoid re-running "
"those trials by passing in the reward attributes as a list so the optimiser can be told the results "
"without needing to re-compute the trial. Must be the same length as `points_to_evaluate`."
)
)
@DeveloperAPI
@ludwig_dataclass
class BaseSearchAlgorithmConfig(schema_utils.BaseMarshmallowConfig):
"""Basic search algorithm settings."""
type: str = schema_utils.String(default="variant_generator", description="The search algorithm to use.")
def set_random_state(self, ludwig_random_state: int) -> None:
"""Overwrite the config random state.
Search algorithms refer to random state by different names, however we want to overwrite unset random states
with the Ludwig random state. This method uses a registry of random state field names to provide a single
interface across all search algorithms.
"""
rs_field = hyperopt_utils.get_search_algorithm_random_state_field(self.type)
if rs_field is not None and self.__getattribute__(rs_field) is None:
self.__setattr__(rs_field, ludwig_random_state)
def dependencies_installed(self) -> bool:
"""Some search algorithms require additional packages to be installed, check that they are available."""
missing_packages = []
missing_installs = []
for package_name, install_name in hyperopt_utils.get_search_algorithm_dependencies(self.type):
if find_spec(package_name) is None:
missing_packages.append(package_name)
missing_installs.append(install_name)
if missing_packages:
missing_packages = ", ".join(missing_packages)
missing_installs = " ".join(missing_installs)
raise ImportError(
f"Some packages needed to use hyperopt search algorithm {self.type} are not installed: "
f"{missing_packages}. To add these dependencies, run `pip install {missing_installs}`. For more "
"details, please refer to Ray Tune documentation for this search algorithm."
)
return True
@DeveloperAPI
def SearchAlgorithmDataclassField(description: str = "", default: dict = {"type": "variant_generator"}):
class SearchAlgorithmMarshmallowField(schema_utils.LudwigSchemaField):
def _deserialize(self, value, attr, data, **kwargs):
if isinstance(value, dict):
try:
return BaseSearchAlgorithmConfig.Schema().load(value)
except (TypeError, ConfigValidationError):
raise ConfigValidationError(
f"Invalid params for scheduler: {value}, see SearchAlgorithmConfig class."
)
raise ConfigValidationError("Field should be dict")
def _jsonschema_type_mapping(self):
return {
# **schema_utils.unload_jsonschema_from_marshmallow_class(BaseSearchAlgorithmConfig),
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": list(hyperopt_utils.search_algorithm_config_registry.keys()),
"default": default["type"],
"description": "The type of scheduler to use during hyperopt",
},
},
"title": "search_algorithm_options",
"required": ["type"],
"description": description,
}
if not isinstance(default, dict):
raise ConfigValidationError(f"Invalid default: `{default}`")
load_default = lambda: BaseSearchAlgorithmConfig.Schema().load(default)
dump_default = BaseSearchAlgorithmConfig.Schema().dump(default)
return field(
metadata={
"marshmallow_field": SearchAlgorithmMarshmallowField(
allow_none=False,
load_default=load_default,
dump_default=dump_default,
metadata={"description": description, "parameter_metadata": None},
)
},
default_factory=load_default,
)
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config("random", random_state_field="random_state")
@hyperopt_utils.register_search_algorithm_config("variant_generator", random_state_field="random_state")
@ludwig_dataclass
class BasicVariantSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.StringOptions(options=["random", "variant_generator"], default="random", allow_none=False)
points_to_evaluate: list[dict] | None = schema_utils.DictList(
description=(
"Initial parameter suggestions to be run first. This is for when you already have some good parameters "
"you want to run first to help the algorithm make better suggestions for future parameters. Needs to be "
"a list of dicts containing the configurations."
)
)
max_concurrent: int = schema_utils.NonNegativeInteger(
default=0, description="Maximum number of concurrently running trials. If 0 (default), no maximum is enforced."
)
constant_grid_search: bool = schema_utils.Boolean(
default=False,
description=(
"If this is set to True, Ray Tune will first try to sample random values and keep them constant over grid "
"search parameters. If this is set to False (default), Ray Tune will sample new random parameters in each "
"grid search condition."
),
)
random_state: int = schema_utils.Integer(
default=None,
allow_none=True,
description=(
"Seed or numpy random generator to use for reproducible results. If None (default), will use the global "
"numpy random generator (np.random). Please note that full reproducibility cannot be guaranteed in a "
"distributed environment."
),
)
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config(
"ax", dependencies=[("ax", "ax-platform"), ("sqlalchemy", "sqlalchemy")]
)
@ludwig_dataclass
class AxSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("ax")
space: list[dict] | None = schema_utils.DictList(
description=(
r"Parameters in the experiment search space. Required elements in the dictionaries are: \“name\” (name of "
r"this parameter, string), \“type\” (type of the parameter: \“range\”, \“fixed\”, or \“choice\”, string), "
r"\“bounds\” for range parameters (list of two values, lower bound first), \“values\” for choice "
r"parameters (list of values), and \“value\” for fixed parameters (single value)."
)
)
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
parameter_constraints: list | None = schema_utils.List(
description=r"Parameter constraints, such as \“x3 >= x4\” or \“x3 + x4 >= 2\”."
)
outcome_constraints: list | None = schema_utils.List(
description=r"Outcome constraints of form \“metric_name >= bound\”, like \“m1 <= 3.\”"
)
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config(
"bayesopt", random_state_field="random_state", dependencies=[("bayes_opt", "bayesian-optimization")]
)
@ludwig_dataclass
class BayesOptSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("bayesopt")
space: dict | None = schema_utils.Dict(
description=(
"Continuous search space. Parameters will be sampled from this space which will be used to run trials"
)
)
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
utility_kwargs: dict | None = schema_utils.Dict(
description=(
"Parameters to define the utility function. The default value is a dictionary with three keys: "
"- kind: ucb (Upper Confidence Bound) - kappa: 2.576 - xi: 0.0"
)
)
random_state: int = schema_utils.Integer(default=None, allow_none=True, description="Used to initialize BayesOpt.")
random_search_steps: int = schema_utils.Integer(
default=10,
description=(
"Number of initial random searches. This is necessary to avoid initial local overfitting of "
"the Bayesian process."
),
)
verbose: int = schema_utils.IntegerOptions(
options=[0, 1, 2], default=0, description="The level of verbosity. `0` is least verbose, `2` is most verbose."
)
patience: int = schema_utils.NonNegativeInteger(
default=5, description="Number of epochs to wait for a change in the top models."
)
skip_duplicate: bool = schema_utils.Boolean(
default=True,
description=(
"If False, the optimizer will allow duplicate points to be registered. This behavior may be desired in "
"high noise situations where repeatedly probing the same point will give different answers. In other "
"situations, the acquisition may occasionaly generate a duplicate point."
),
)
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config("blendsearch", dependencies=[("flaml", "flaml[blendsearch]")])
@ludwig_dataclass
class BlendsearchSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("blendsearch")
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config(
"bohb", random_state_field="seed", dependencies=[("hpbandster", "hpbandster"), ("ConfigSpace", "ConfigSpace")]
)
@ludwig_dataclass
class BOHBSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("bohb")
space: dict | None = schema_utils.Dict(
description=(
"Continuous ConfigSpace search space. Parameters will be sampled from this space which will be used "
"to run trials."
)
)
bohb_config: dict | None = schema_utils.Dict(description="configuration for HpBandSter BOHB algorithm")
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
seed: int | None = schema_utils.Integer(
default=None,
allow_none=True,
description=(
"Optional random seed to initialize the random number generator. Setting this should lead to identical "
"initial configurations at each run."
),
)
max_concurrent: int = schema_utils.Integer(
default=0,
description=(
"Number of maximum concurrent trials. If this Searcher is used in a `ConcurrencyLimiter`, the "
"`max_concurrent` value passed to it will override the value passed here. Set to <= 0 for no limit on "
"concurrency."
),
)
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config("cfo", dependencies=[("flaml", "flaml")])
@ludwig_dataclass
class CFOSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("cfo")
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config(
"dragonfly", random_state_field="random_state_seed", dependencies=[("dragonfly", "dragonfly-opt")]
)
@ludwig_dataclass
class DragonflySAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("dragonfly")
optimizer: str | None = schema_utils.StringOptions(
options=["random", "bandit", "genetic"],
default=None,
allow_none=True,
description=(
"Optimizer provided from dragonfly. Choose an optimiser that extends `BlackboxOptimiser`. If this is a "
"string, `domain` must be set and `optimizer` must be one of [random, bandit, genetic]."
),
)
domain: str | None = schema_utils.StringOptions(
options=["cartesian", "euclidean"],
default=None,
allow_none=True,
description=(
"Optional domain. Should only be set if you don't pass an optimizer as the `optimizer` argument. If set, "
"has to be one of `[cartesian, euclidean]`."
),
)
space: list[dict] | None = schema_utils.DictList(
description=(
"Search space. Should only be set if you don't pass an optimizer as the `optimizer` argument. Defines the "
"search space and requires a `domain` to be set. Can be automatically converted from the `param_space` "
"dict passed to `tune.Tuner()`."
)
)
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
evaluated_rewards: list | None = evaluated_rewards_field()
random_state_seed: int | None = schema_utils.Integer(
default=None,
allow_none=True,
description=(
"Seed for reproducible results. Defaults to None. Please note that setting this to a value will change "
"global random state for `numpy` on initalization and loading from checkpoint."
),
)
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config(
"hebo", random_state_field="random_state_seed", dependencies=[("hebo", "HEBO")]
)
@ludwig_dataclass
class HEBOSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("hebo")
space: list[dict] | None = schema_utils.DictList(
description="A dict mapping parameter names to Tune search spaces or a HEBO DesignSpace object."
)
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
evaluated_rewards: list | None = evaluated_rewards_field()
random_state_seed: int | None = schema_utils.Integer(
default=None,
allow_none=True,
description=(
"Seed for reproducible results. Defaults to None. Please note that setting this to a value will change "
"global random state for `numpy` on initalization and loading from checkpoint."
),
)
max_concurrent: int = schema_utils.NonNegativeInteger(
default=8,
description=(
"Number of maximum concurrent trials. If this Searcher is used in a `ConcurrencyLimiter`, the "
"`max_concurrent` value passed to it will override the value passed here."
),
)
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config(
"hyperopt", random_state_field="random_state_seed", dependencies=[("hyperopt", "hyperopt")]
)
@ludwig_dataclass
class HyperoptSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("hyperopt")
space: list[dict] | None = schema_utils.DictList(
description=(
"HyperOpt configuration. Parameters will be sampled from this configuration and will be used to override "
"parameters generated in the variant generation process."
)
)
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
n_initial_points: int = schema_utils.PositiveInteger(
default=20,
description=(
"The number of random evaluations of the objective function before starting to approximate it with tree "
"parzen estimators. Defaults to 20."
),
)
random_state_seed: int | None = schema_utils.Integer(
default=None,
allow_none=True,
description=("Seed for reproducible results. Defaults to None."),
)
gamma: float = schema_utils.FloatRange(
min=0.0,
max=1.0,
default=0.25,
description=(
"The split to use in TPE. TPE models two splits of the evaluated hyperparameters: the top performing "
"`gamma` percent, and the remaining examples. For more details, see [Making a Science of Model Search: "
"Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures.]"
"(http://proceedings.mlr.press/v28/bergstra13.pdf)."
),
)
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config("nevergrad", dependencies=[("nevergrad", "nevergrad")])
@ludwig_dataclass
class NevergradSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("nevergrad")
# TODO: Add a registry mapping string names to nevergrad optimizers
# optimizer: Optional[str] = None
# TODO: Add schemas for nevergrad optimizer kwargs
optimizer_kwargs: dict | None = schema_utils.Dict(description="Kwargs passed in when instantiating the optimizer.")
space: list[dict] | None = schema_utils.DictList(
description=(
"Nevergrad parametrization to be passed to optimizer on instantiation, or list of parameter names if you "
"passed an optimizer object."
)
)
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config(
"optuna", random_state_field="seed", dependencies=[("optuna", "optuna")]
)
@ludwig_dataclass
class OptunaSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("optuna")
space: dict | None = schema_utils.Dict(
description=(
"Hyperparameter search space definition for Optuna's sampler. This can be either a dict with parameter "
"names as keys and optuna.distributions as values, or a Callable - in which case, it should be a "
"define-by-run function using optuna.trial to obtain the hyperparameter values. The function should "
"return either a dict of constant values with names as keys, or None. For more information, see "
"[the Optuna docs]"
"(https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html)."
)
)
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
# TODO: Add a registry of Optuna samplers schemas
# sampler = None
seed: int | None = schema_utils.Integer(
default=None,
allow_none=True,
description=(
"Seed to initialize sampler with. This parameter is only used when `sampler=None`. In all other cases, "
"the sampler you pass should be initialized with the seed already."
),
)
evaluated_rewards: list | None = evaluated_rewards_field()
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config("skopt", dependencies=[("skopt", "scikit-optimize")])
class SkoptSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("skopt")
optimizer: Any | None = None
space: dict | None = schema_utils.Dict(
description=(
"A dict mapping parameter names to valid parameters, i.e. tuples for numerical parameters and lists "
"for categorical parameters. If you passed an optimizer instance as the optimizer argument, this should "
"be a list of parameter names instead."
)
)
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
evaluated_rewards: list | None = evaluated_rewards_field(
description=(
"If you have previously evaluated the parameters passed in as points_to_evaluate you can avoid "
"re-running those trials by passing in the reward attributes as a list so the optimiser can be told the "
"results without needing to re-compute the trial. Must be the same length as points_to_evaluate. (See "
"tune/examples/skopt_example.py)"
)
)
convert_to_python: bool = schema_utils.Boolean(
default=True,
description="SkOpt outputs numpy primitives (e.g. `np.int64`) instead of Python types. If this setting is set "
"to `True`, the values will be converted to Python primitives.",
)
@DeveloperAPI
@hyperopt_utils.register_search_algorithm_config("zoopt", dependencies=[("zoopt", "zoopt")])
@ludwig_dataclass
class ZooptSAConfig(BaseSearchAlgorithmConfig):
type: str = schema_utils.ProtectedString("zoopt")
algo: str = schema_utils.ProtectedString(
pstring="asracos",
description="To specify an algorithm in zoopt you want to use. Only support ASRacos currently.",
)
budget: int | None = schema_utils.PositiveInteger(
default=None, allow_none=True, description="Optional. Number of samples."
)
dim_dict: dict | None = schema_utils.Dict(
description=(
"Dimension dictionary. For continuous dimensions: (continuous, search_range, precision); For discrete "
"dimensions: (discrete, search_range, has_order); For grid dimensions: (grid, grid_list). More details "
"can be found in zoopt package."
)
)
points_to_evaluate: list[dict] | None = points_to_evaluate_field()
parallel_num: int = schema_utils.PositiveInteger(
default=1,
description=(
"How many workers to parallel. Note that initial phase may start less workers than this number. More "
"details can be found in zoopt package."
),
)
================================================
FILE: ludwig/schema/hyperopt/utils.py
================================================
from collections.abc import Callable
from ludwig.api_annotations import DeveloperAPI
from ludwig.utils.registry import Registry
parameter_config_registry = Registry()
scheduler_config_registry = Registry()
scheduler_dependencies_registry = Registry()
search_algorithm_config_registry = Registry()
search_algorithm_dependencies_registry = Registry()
search_algorithm_random_state_field_registry = Registry()
@DeveloperAPI
def get_parameter_cls(name: str) -> type["BaseParameterConfig"]: # noqa: F821
"""Get a registered hyperopt parameter config class by name.
Args:
name: the name of a parameter config class registered in `ludwig.schema.hyperopt.parameter`
Returns:
A parameter config class from `ludwig.schema.hyperopt.parameter`
"""
return parameter_config_registry[name]
@DeveloperAPI
def get_scheduler_cls(name: str) -> type["BaseSchedulerConfig"]: # noqa: F821
"""Get a registered hyperopt scheduler config class by name.
Args:
name: the name of a scheduler config class registered in `ludwig.schema.hyperopt.scheduler`
Returns:
A scheduler config class from `ludwig.schema.hyperopt.scheduler`
"""
return search_algorithm_config_registry[name]
@DeveloperAPI
def get_scheduler_dependencies(name: str) -> list[str]:
"""Get the list of dependencies for a registered hyperopt scheduler.
Args:
name: the name of a scheduler config class registered in `ludwig.schema.hyperopt.scheduler`
Returns:
The list of imports needed to use the scheduler
"""
return scheduler_dependencies_registry[name]
@DeveloperAPI
def get_search_algorithm_cls(name: str) -> type["BaseSearchAlgorithmConfig"]: # noqa: F821
"""Get a registered hyperopt search algorithm config class by name.
Args:
name: the name of a search algorithm config class registered in `ludwig.schema.hyperopt.search_algorithm`
Returns:
A scheduler config class from `ludwig.schema.hyperopt.search_algorithm`
"""
return search_algorithm_config_registry[name]
@DeveloperAPI
def get_search_algorithm_dependencies(name: str) -> list[str]:
"""Get the list of dependencies for a registered hyperopt search algorithm.
Args:
name: the name of a search algorithm config class registered in `ludwig.schema.hyperopt.search_algorithm`
Returns:
The list of imports needed to use the search algorithm
"""
return search_algorithm_dependencies_registry[name]
@DeveloperAPI
def get_search_algorithm_random_state_field(name: str):
"""Get the field name of the random state for a registered hyperopt search algorithm.
Args:
name: the name of a search algorithm config class registered in `ludwig.schema.hyperopt.search_algorithm`
Returns:
The name of the random state field in the config
"""
return search_algorithm_random_state_field_registry[name]
@DeveloperAPI
def register_parameter_config(name: str) -> Callable:
"""Register a parameter config class by name.
Args:
name: the name to register the parameter class under, does not need to correspond to the value of `space`
Returns:
Wrapper function to decorate a `BaseParameterConfig` subclass
"""
def wrap(cls: type["BaseParameterConfig"]) -> type["BaseParameterConfig"]: # noqa: F821
"""Add a parameter config class to the registry.
Args:
cls: a subclass of `BaseParameterConfig`
Returns:
`cls` unaltered
"""
parameter_config_registry[name] = cls
return cls
return wrap
@DeveloperAPI
def register_scheduler_config(name: str, dependencies: list[tuple[str]] | None = None):
"""Register a scheduler config class by name.
Args:
name: the name to scheduler the parameter class under, does not need to correspond to the value of `type`
dependencies: the list of scheduler dependency package name/install name pairs, e.g.
`("sklearn", "scikit-learn")`
Returns:
Wrapper function to decorate a `BaseSchedulerConfig` subclass
"""
def wrap(scheduler_config: type["BaseSchedulerConfig"]) -> type["BaseSchedulerConfig"]: # noqa: F821
"""Add a parameter config class to the registry.
Args:
cls: a subclass of `BaseParameterConfig`
Returns:
`cls` unaltered
"""
scheduler_config_registry[name] = scheduler_config
scheduler_dependencies_registry[name] = dependencies if dependencies is not None else []
return scheduler_config
return wrap
# TODO: create a search alg metadata class to register in place of individual metadata args
@DeveloperAPI
def register_search_algorithm_config(
name: str, random_state_field: str | None = None, dependencies: list[tuple[str, str]] | None = None
) -> Callable:
"""Register a search algorithm config class by name.
Args:
name: the name to register the search algorithm class under, does not need to correspond to the value of `type`
random_state_field: the name of the random state in this search algorithm
dependencies: the list of search algorithm dependency package name/install name pairs, e.g.
`("sklearn", "scikit-learn")`
Returns:
Wrapper function to decorate a `BaseSearchAlgorithmConfig` subclass
"""
def wrap(cls: type["BaseSearchAlgorithmConfig"]) -> type["BaseSearchAlgorithmConfig"]: # noqa: F821
search_algorithm_config_registry[name] = cls
search_algorithm_dependencies_registry[name] = dependencies if dependencies is not None else []
search_algorithm_random_state_field_registry[name] = random_state_field
return cls
return wrap
================================================
FILE: ludwig/schema/jsonschema.py
================================================
"""JSON Schema generation for Ludwig config classes.
Uses pydantic's model_json_schema() under the hood, replacing the previous marshmallow-based converter.
"""
def marshmallow_schema_to_jsonschema_dict(schema_instance):
"""Backward-compatible JSON schema generation.
Previously converted marshmallow schemas. Now uses pydantic's model_json_schema().
The schema_instance can be either:
- A pydantic model class (BaseMarshmallowConfig subclass)
- A _SchemaAdapter instance
- Legacy: called with a marshmallow Schema instance (raises helpful error)
"""
from ludwig.schema.utils import _SchemaAdapter, BaseMarshmallowConfig
# Handle _SchemaAdapter
if isinstance(schema_instance, _SchemaAdapter):
cls = schema_instance._cls
elif isinstance(schema_instance, type) and issubclass(schema_instance, BaseMarshmallowConfig):
cls = schema_instance
elif isinstance(schema_instance, BaseMarshmallowConfig):
cls = type(schema_instance)
else:
raise TypeError(
f"Expected a Ludwig config class or schema adapter, got {type(schema_instance)}. "
"Marshmallow schemas are no longer supported. Use pydantic BaseModel subclasses."
)
schema_dict = cls.model_json_schema()
name = cls.__name__
# Wrap in definitions format for backward compat
return {
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {name: schema_dict},
"$ref": f"#/definitions/{name}",
}
================================================
FILE: ludwig/schema/llms/__init__.py
================================================
================================================
FILE: ludwig/schema/llms/base_model.py
================================================
import logging
import os
from dataclasses import field
from transformers import AutoConfig
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BASE_MODEL
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import LLM_METADATA
from ludwig.schema.metadata.parameter_metadata import convert_metadata_to_json
logger = logging.getLogger(__name__)
# Maps a preset LLM name to the full slash-delimited HF path. If the user chooses a preset LLM, the preset LLM name is
# replaced with the full slash-delimited HF path using this map, after JSON validation but before config object
# initialization.
MODEL_PRESETS = {
# Bloom
"bloomz-3b": "bigscience/bloomz-3b",
"bloomz-7b1": "bigscience/bloomz-7b1",
# CodeLlama
"codellama-7b": "codellama/CodeLlama-7b-hf",
"codellama-13b": "codellama/CodeLlama-13b-hf",
"codellama-34b": "codellama/CodeLlama-34b-hf",
"codellama-7b-instruct": "codellama/CodeLlama-7b-instruct-hf",
"codellama-13b-instruct": "codellama/CodeLlama-13b-instruct-hf",
"codellama-34b-instruct": "codellama/CodeLlama-34b-instruct-hf",
# GPT Neo and GPT J
"gpt-neo-2.7B": "EleutherAI/gpt-neo-2.7B",
"gpt-j-6b": "EleutherAI/gpt-j-6b",
# LLama-2
"llama-2-7b": "meta-llama/Llama-2-7b-hf",
"llama-2-13b": "meta-llama/Llama-2-13b-hf",
"llama-2-70b": "meta-llama/Llama-2-70b-hf",
"llama-2-7b-chat": "meta-llama/Llama-2-7b-chat-hf",
"llama-2-13b-chat": "meta-llama/Llama-2-13b-chat-hf",
"llama-2-70b-chat": "meta-llama/Llama-2-70b-chat-hf",
# Mistral
"mistral-7b": "mistralai/Mistral-7B-v0.1",
"mistral-7b-instruct": "mistralai/Mistral-7B-Instruct-v0.1",
# Mixtral
"mixtral-8x7b": "mistralai/Mixtral-8x7B-v0.1",
"mixtral-8x7b-instruct": "mistralai/Mixtral-8x7B-Instruct-v0.1",
# OPT
"opt-350m": "facebook/opt-350m",
"opt-1.3b": "facebook/opt-1.3b",
"opt-6.7b": "facebook/opt-6.7b",
# Pythia
"pythia-2.8b": "EleutherAI/pythia-2.8b",
"pythia-12b": "EleutherAI/pythia-12b",
# Vicuna
"vicuna-7b": "lmsys/vicuna-7b-v1.3",
"vicuna-13b": "lmsys/vicuna-13b-v1.3",
# Zephyr
"zephyr-7b-alpha": "HuggingFaceH4/zephyr-7b-alpha",
"zephyr-7b-beta": "HuggingFaceH4/zephyr-7b-beta",
# Phi
"phi-1": "microsoft/phi-1",
"phi-1_5": "microsoft/phi-1_5",
"phi-2": "microsoft/phi-2",
}
@DeveloperAPI
def BaseModelDataclassField():
description = (
"Base pretrained model to use. This can be one of the presets defined by Ludwig, a fully qualified "
"name of a pretrained model from the HuggingFace Hub, or a path to a directory containing a "
"pretrained model."
)
def validate(model_name: str):
"""Validates and upgrades the given model name to its full path, if applicable.
If the name exists in `MODEL_PRESETS`, returns the corresponding value from the dict; otherwise checks if the
given name (which should be a full path) exists locally or in the transformers library.
"""
if isinstance(model_name, str):
if model_name in MODEL_PRESETS:
return MODEL_PRESETS[model_name]
if os.path.isdir(model_name):
return model_name
try:
AutoConfig.from_pretrained(model_name, trust_remote_code=True)
return model_name
except OSError:
raise ConfigValidationError(
f"Specified base model `{model_name}` could not be loaded. If this is a private repository, make "
f"sure to set HUGGING_FACE_HUB_TOKEN in your environment. Check that {model_name} is a valid "
"pretrained CausalLM listed on huggingface or a valid local directory containing the weights for a "
"pretrained CausalLM from huggingface. See: "
"https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads for a full list."
)
raise ConfigValidationError(
f"`base_model` should be a string, instead given: {model_name}. This can be a preset or any pretrained "
"CausalLM on huggingface. See: https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads"
)
class BaseModelField(schema_utils.LudwigSchemaField):
def _serialize(self, value, attr, obj, **kwargs):
if isinstance(value, str):
return value
raise ConfigValidationError(f"Value to serialize is not a string: {value}")
def _deserialize(self, value, attr, obj, **kwargs):
return validate(value)
def _jsonschema_type_mapping(self):
return {
"anyOf": [
{
"type": "string",
"enum": list(MODEL_PRESETS.keys()),
"description": (
"Pick from a set of popular LLMs of different sizes across a variety of architecture types."
),
"title": "preset",
"parameter_metadata": convert_metadata_to_json(LLM_METADATA[BASE_MODEL]["_anyOf"]["preset"]),
},
{
"type": "string",
"description": "Enter the full path to a huggingface LLM.",
"title": "custom",
"parameter_metadata": convert_metadata_to_json(LLM_METADATA[BASE_MODEL]["_anyOf"]["custom"]),
},
],
"description": description,
"title": "base_model_options",
"parameter_metadata": convert_metadata_to_json(LLM_METADATA[BASE_MODEL]["_meta"]),
}
return field(
metadata={
"marshmallow_field": BaseModelField(
required=True,
allow_none=False,
validate=validate,
metadata={ # TODO: extra metadata dict probably unnecessary, but currently a widespread pattern
"description": description,
"parameter_metadata": convert_metadata_to_json(LLM_METADATA[BASE_MODEL]["_meta"]),
},
),
},
# TODO: This is an unfortunate side-effect of dataclass init order - you cannot have non-default fields follow
# default fields, so we have to give `base_model` a fake default of `None`.
default=None,
)
================================================
FILE: ludwig/schema/llms/generation.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import LLM_METADATA
@DeveloperAPI
@schema_utils.ludwig_dataclass
class LLMGenerationConfig(schema_utils.BaseMarshmallowConfig):
"""Parameters for LLM Generation Config.
Should match the parameters in
https://huggingface.co/docs/transformers/v4.28.0/en/main_classes/text_generation#transformers.GenerationConfig
"""
# Parameters that control the length of the output
max_new_tokens: int | None = schema_utils.PositiveInteger(
default=32,
allow_none=True,
description="The maximum number of new tokens to generate, ignoring the number of tokens in the input prompt. "
"If not set, this is dynamically determined by Ludwig based on either the `max_sequence_length` of the ouput "
"feature, the global_max_sequence_length specified in preprocessing (if specified), or the "
"maximum context length supported by the model (in the order specified).",
parameter_metadata=LLM_METADATA["generation"]["max_new_tokens"],
)
min_new_tokens: int | None = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="The minimum number of new tokens to generate, ignoring the number of tokens in the input prompt.",
parameter_metadata=LLM_METADATA["generation"]["min_new_tokens"],
)
max_length: int = schema_utils.PositiveInteger(
default=32,
allow_none=True,
description="The maximum length the generated tokens can have. Corresponds to the length of the input prompt "
"+ max_new_tokens. Its effect is overridden by max_new_tokens, if also set.",
parameter_metadata=LLM_METADATA["generation"]["max_length"],
)
min_length: int = schema_utils.NonNegativeInteger(
default=0,
allow_none=True,
description="The minimum length of the sequence to be generated. Corresponds to the length of the "
"input prompt + min_new_tokens. Its effect is overridden by min_new_tokens, if also set.",
parameter_metadata=LLM_METADATA["generation"]["min_length"],
)
early_stopping: bool | str | None = schema_utils.Boolean(
default=False,
description="Controls the stopping condition for beam-based methods, like beam-search. It accepts the following"
" values: True, where the generation stops as soon as there are num_beams complete candidates; False, where an "
"heuristic is applied and the generation stops when is it very unlikely to find better candidates; `never`, "
"where the beam search procedure only stops when there cannot be better candidates (canonical beam search "
"algorithm)",
)
max_time: float | None = schema_utils.FloatRange(
default=None,
min=None,
max=None,
allow_none=True,
description="The maximum amount of time you allow the computation to run for in seconds. generation will still"
" finish the current pass after allocated time has been passed. ",
)
# Parameters that control the generation strategy used
do_sample: bool | None = schema_utils.Boolean(
default=True,
description="Whether or not to use sampling ; use greedy decoding otherwise.",
parameter_metadata=LLM_METADATA["generation"]["do_sample"],
)
num_beams: int | None = schema_utils.PositiveInteger(
default=1,
allow_none=True,
description="Number of beams for beam search. 1 means no beam search and is the default value."
" The beam search strategy generates the translation word by word from left-to-right while keeping a fixed"
" number (beam) of active candidates at each time step during token generation. By increasing the beam size,"
" the translation performance can increase at the expense of significantly reducing the decoder speed.",
parameter_metadata=LLM_METADATA["generation"]["num_beams"],
)
num_beam_groups: int | None = schema_utils.PositiveInteger(
default=1,
allow_none=True,
description="Number of groups to divide num_beams into in order to ensure diversity among different groups of "
"beams. 1 means no group beam search.",
)
penalty_alpha: float | None = schema_utils.NonNegativeFloat(
default=None,
allow_none=True,
description="The values balance the model confidence and the degeneration penalty in contrastive "
" search decoding.",
)
use_cache: bool | None = schema_utils.Boolean(
default=True,
description="Whether or not the model should use the past last key/values attentions (if applicable to the "
"model) to speed up decoding.",
parameter_metadata=LLM_METADATA["generation"]["use_cache"],
)
prompt_lookup_num_tokens: int | None = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="The number of tokens to consider as a candidate from the prompt for prompt lookup decoding, "
" an alternate way of performing assisted generation. If set to 0, the prompt lookup decoding is not used.",
parameter_metadata=LLM_METADATA["generation"]["prompt_lookup_num_tokens"],
)
# Parameters for manipulation of the model output logits
temperature: float | None = schema_utils.NonNegativeFloat(
default=0.1,
allow_none=True,
description="Temperature is used to control the randomness of predictions."
" A high temperature value (closer to 1) makes the output more diverse and random, while a lower temperature"
" (closer to 0) makes the model's responses more deterministic and focused on the most likely outcome."
" In other words, temperature adjusts the probability distribution from which the model picks the next token.",
parameter_metadata=LLM_METADATA["generation"]["temperature"],
)
top_k: int | None = schema_utils.PositiveInteger(
default=50,
allow_none=True,
description="The number of highest probability vocabulary tokens to keep for top-k-filtering.",
parameter_metadata=LLM_METADATA["generation"]["top_k"],
)
top_p: float | None = schema_utils.FloatRange(
default=1.0,
min=0.0,
max=1.0,
allow_none=True,
description="If set to float < 1, only the most probable tokens with probabilities that add up to "
"top_p or higher are kept for generation.",
parameter_metadata=LLM_METADATA["generation"]["top_p"],
)
typical_p: float | None = schema_utils.FloatRange(
default=1.0,
min=0.0,
max=1.0,
allow_none=True,
description="Local typicality measures how similar the conditional probability of predicting a target token "
"next is to the expected conditional probability of predicting a random token next, given the partial text "
"already generated. If set to float < 1, the smallest set of the most locally typical tokens with "
"probabilities that add up to typical_p or higher are kept for generation.",
)
epsilon_cutoff: float | None = schema_utils.FloatRange(
default=0.0,
min=0.0,
max=1.0,
allow_none=True,
description="If set to float strictly between 0 and 1, only tokens with a conditional probability greater "
"than epsilon_cutoff will be sampled. In the paper, suggested values range from 3e-4 to 9e-4, depending on the"
" size of the model.",
)
eta_cutoff: float | None = schema_utils.FloatRange(
default=0.0,
min=0.0,
max=1.0,
allow_none=True,
description="Eta sampling is a hybrid of locally typical sampling and epsilon sampling. If set to float "
"strictly between 0 and 1, a token is only considered if it is greater than either eta_cutoff or "
"sqrt(eta_cutoff) * exp(-entropy(softmax(next_token_logits))). The latter term is intuitively the expected next"
" token probability, scaled by sqrt(eta_cutoff). In the paper, suggested values range from 3e-4 to 2e-3, "
"depending on the size of the model.",
)
diversity_penalty: float | None = schema_utils.NonNegativeFloat(
default=0.0,
allow_none=True,
description="The value used to control the diversity of the generated text. The higher the value, the more "
"diverse the text will be. If set to 0, no diversity is enforced."
"This value is subtracted from a beam(s) score if it generates a token same as any beam from other group at a"
"particular time. Note that diversity_penalty is only effective if group beam search is enabled.",
)
repetition_penalty: float | None = schema_utils.NonNegativeFloat(
default=1.0,
allow_none=True,
description="The parameter for repetition penalty. 1.0 means no penalty. "
"See [this paper](https://arxiv.org/pdf/1909.05858.pdf) for more details.",
)
encoder_repetition_penalty: float | None = schema_utils.NonNegativeFloat(
default=1.0,
allow_none=True,
description="The paramater for encoder_repetition_penalty. An exponential penalty on sequences that are not"
" in the original input. 1.0 means no penalty.",
)
length_penalty: float | None = schema_utils.Float(
default=1.0,
allow_none=True,
description="Exponential penalty to the length that is used with beam-based generation. It is applied as an "
"exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is"
" the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while "
"length_penalty < 0.0 encourages shorter sequences.",
)
no_repeat_ngram_size: int | None = schema_utils.NonNegativeInteger(
default=0,
allow_none=True,
description="If set to int > 0, all ngrams of that size can only occur once.",
)
bad_words_ids: list[list[int]] | None = schema_utils.List(
default=None,
allow_none=True,
description="List of token ids that are not allowed to be generated. In order to get the tokens of the words "
"that should not appear in the generated text, use tokenizer(bad_word, add_prefix_space=True).input_ids.",
)
force_words_ids: list[list[int]] | None = schema_utils.List(
default=None,
allow_none=True,
description="List of token ids that are forced to be generated by the model. In order to get the tokens of the"
" words that should appear in the generated text, use tokenizer(force_word, add_prefix_space=True).input_ids.",
)
renormalize_logits: bool | None = schema_utils.Boolean(
default=False,
description="Whether to renormalize the logits after temperature and top_k/top_p filtering.",
)
# TODO(This needs to be defined based on the Constraint class)
# constraints:
forced_bos_token_id: int | None = schema_utils.Integer(
default=None,
allow_none=True,
description="The id of the token to force as the first generated token after the decoder_start_token_id."
"Useful for multilingual models like mBART where the first generated token needs to be the target language"
"token.",
)
forced_eos_token_id: int | list[int] | None = schema_utils.Integer(
default=None,
allow_none=True,
description="The id of the token to force as the last generated token when max_length is reached. Optionally, "
"use a list to set multiple end-of-sequence tokens.",
)
remove_invalid_values: bool | None = schema_utils.Boolean(
default=False,
description="Whether to remove possible nan and inf outputs of the model to prevent the generation method to "
"crash. Note that using remove_invalid_values can slow down generation.",
)
exponential_decay_length_penalty: tuple[int, float] | None = schema_utils.FloatRange(
default=None,
min=0.0,
max=1.0,
allow_none=True,
description="This Tuple adds an exponentially increasing length penalty, after a certain amount of tokens have "
"been generated. The tuple shall consist of: (start_index, decay_factor) where start_index indicates where "
"penalty starts and decay_factor represents the factor of exponential decay",
)
suppress_tokens: list[int] | None = schema_utils.List(
list_type=int,
default=None,
allow_none=True,
description="A list of tokens that will be suppressed at generation. The SupressTokens logit processor will set"
" their log probs to -inf so that they are not sampled.",
)
begin_suppress_tokens: list[int] | None = schema_utils.List(
list_type=int,
default=None,
allow_none=True,
description="A list of tokens that will be suppressed at the beginning of the generation. The "
"SupressBeginTokens logit processor will set their log probs to -inf so that they are not sampled.",
)
forced_decoder_ids: list[list[int]] | None = schema_utils.List(
default=None,
allow_none=True,
description="A list of forced decoder ids. The ForcedDecoderIds logit processor will set the log probs of all "
"tokens that are not in the list to -inf so that they are not sampled.",
)
sequence_bias: dict[tuple[int], float] | None = schema_utils.Dict(
default=None,
allow_none=True,
description="A dictionary of token ids to bias the generation towards. The SequenceBias logit processor will "
"add the bias to the log probs of the tokens in the dictionary. Positive biases increase the odds of the "
"sequence being selected, while negative biases do the opposite. ",
)
guidance_scale: float | None = schema_utils.FloatRange(
default=None,
min=0.0,
allow_none=True,
description="The guidance scale for classifier free guidance (CFG). CFG is enabled by setting guidance_scale >"
" 1. Higher guidance scale encourages the model to generate samples that are more closely linked to the input"
" prompt, usually at the expense of poorer quality.",
)
# Special tokens that can be used at generation time
pad_token_id: int | None = schema_utils.Integer(
default=None,
allow_none=True,
description="The id of the padding token. If not set, the padding token id of the tokenizer is used.",
)
bos_token_id: int | None = schema_utils.Integer(
default=None,
allow_none=True,
description="The id of the beginning of sentence token. If not set, the bos token id of the tokenizer is used.",
)
eos_token_id: int | list[int] | None = schema_utils.Integer(
default=None,
allow_none=True,
description="The id of the end of sentence token. If not set, the eos token id of the tokenizer is used.",
)
@DeveloperAPI
class LLMGenerationConfigField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(LLMGenerationConfig)
def _jsonschema_type_mapping(self):
return schema_utils.unload_jsonschema_from_marshmallow_class(LLMGenerationConfig)
================================================
FILE: ludwig/schema/llms/model_parameters.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class RoPEScalingConfig(schema_utils.BaseMarshmallowConfig):
"""Dynamic RoPE-scaling (rotary position embeddings) to extend the context length of LLM like LLaMA, GPT-NeoX,
or Falcon.
This parameter is a dictionary containing the scaling configuration for the RoPE embeddings. Currently supports
three scaling strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected
format is {'rope_type': strategy name, 'factor': scaling factor}
"""
def __post_init__(self):
# Both parameters must be set, or none.
if not self.rope_type:
raise ConfigValidationError(
f"`rope_scaling`'s `rope_type` field must be one of ['linear', 'dynamic'], got {self.rope_type}"
)
if not self.factor:
raise ConfigValidationError(
f"When using `rope_scaling`, `factor` must be specified and be > 1. Got {self.factor}."
)
rope_type: str | None = schema_utils.StringOptions(
options=["linear", "dynamic"],
default=None,
allow_none=True,
description="Currently supports two strategies: linear and dynamic scaling.",
)
factor: float | None = schema_utils.FloatRange(
default=None,
allow_none=True,
min=1.0,
min_inclusive=False,
description="The scaling factor for RoPE embeddings.",
)
@DeveloperAPI
class RoPEScalingConfigField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(RoPEScalingConfig, default_missing=True)
def _jsonschema_type_mapping(self):
return schema_utils.unload_jsonschema_from_marshmallow_class(RoPEScalingConfig, title="rope_scaling")
@DeveloperAPI
@ludwig_dataclass
class ModelParametersConfig(schema_utils.BaseMarshmallowConfig):
rope_scaling: RoPEScalingConfig = RoPEScalingConfigField().get_default_field()
neftune_noise_alpha: int | None = schema_utils.IntegerRange(
default=0,
min=0,
allow_none=True,
description="The alpha parameter for the embedding noise, which controls the amount of noise added to the "
"embeddings. The higher the value, the more noise is added. This is based on the paper NEFTune: Noisy "
"Embeddings Improve Instruction Finetuning. Paper: https://arxiv.org/pdf/2310.05914.pdf. Default: 0."
"Suggested values: 5, 10",
)
def to_dict(self):
config = {}
if self.rope_scaling:
config["rope_scaling"] = self.rope_scaling.to_dict()
if self.neftune_noise_alpha:
config["neftune_noise_alpha"] = self.neftune_noise_alpha
return config
@DeveloperAPI
class ModelParametersConfigField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(ModelParametersConfig, default_missing=True)
def _jsonschema_type_mapping(self):
return {
"oneOf": [
{"type": "null", "title": "disabled", "description": "Skip configurable model parameters."},
{
**schema_utils.unload_jsonschema_from_marshmallow_class(ModelParametersConfig),
"title": "enabled",
"description": "Set model parameters options.",
},
],
"title": "Model Parameters",
"description": "Configurable model parameters for LLMs.",
}
================================================
FILE: ludwig/schema/llms/peft.py
================================================
from abc import ABC, abstractmethod
from typing import TYPE_CHECKING
from ludwig.api_annotations import DeveloperAPI
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import LLM_METADATA
from ludwig.schema.metadata.parameter_metadata import convert_metadata_to_json
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils.registry import Registry
if TYPE_CHECKING:
from peft import PeftConfig
adapter_registry = Registry()
@DeveloperAPI
def register_adapter(name: str):
def wrap(config: BaseAdapterConfig):
adapter_registry[name] = config
return config
return wrap
@DeveloperAPI
@ludwig_dataclass
class LoraPostprocessorConfig(schema_utils.BaseMarshmallowConfig):
"""This Dataclass is a schema for the nested postprocessing config under adapter of type "lora"."""
merge_adapter_into_base_model: bool = schema_utils.Boolean(
default=False,
description="""Instructs whether or not the fine-tuned LoRA weights are to be merged into the base LLM model so
that the complete fine-tuned model is available to be used and/or persisted, and then reused upon loading as a single
model (rather than having to load base and fine-tuned models separately).""",
)
progressbar: bool = schema_utils.Boolean(
default=False,
description="Instructs whether or not to show a progress bar indicating the unload and merge process.",
)
@DeveloperAPI
class LoraPostprocessorConfigField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(LoraPostprocessorConfig)
def _jsonschema_type_mapping(self):
return schema_utils.unload_jsonschema_from_marshmallow_class(LoraPostprocessorConfig, title="LoraPostprocessor")
@DeveloperAPI
@ludwig_dataclass
class BaseAdapterConfig(schema_utils.BaseMarshmallowConfig, ABC):
type: str
pretrained_adapter_weights: str | None = schema_utils.String(
default=None, description="Path to pretrained weights.", allow_none=True
)
postprocessor: LoraPostprocessorConfig = LoraPostprocessorConfigField().get_default_field()
@abstractmethod
def to_config(self, **kwargs) -> "PeftConfig":
pass
@DeveloperAPI
@register_adapter(name="lora")
@ludwig_dataclass
class LoraConfig(BaseAdapterConfig):
def __post_init__(self):
if self.alpha is None:
self.alpha = self.r * 2
type: str = schema_utils.ProtectedString(
"lora",
description=LLM_METADATA["adapter"]["lora"]["type"].long_description,
)
r: int = schema_utils.PositiveInteger(
default=8,
description="Lora attention dimension.",
parameter_metadata=LLM_METADATA["adapter"]["lora"]["r"],
)
alpha: int | None = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The alpha parameter for Lora scaling. Defaults to `2 * r`.",
parameter_metadata=LLM_METADATA["adapter"]["lora"]["alpha"],
)
dropout: float = schema_utils.NonNegativeFloat(
default=0.05,
description="The dropout probability for Lora layers.",
parameter_metadata=LLM_METADATA["adapter"]["lora"]["dropout"],
)
# TODO(travis): figure out why calling this `bias` doesn't work
bias_type: str = schema_utils.StringOptions(
options=["none", "all", "lora_only"],
default="none",
description="Bias type for Lora.",
)
target_modules: list[str] | None = schema_utils.List(
default=None,
allow_none=True,
description=(
"List of module names or regex expression of the module names to replace with LoRA. "
"For example, ['q', 'v'] or '.*decoder.*(SelfAttention|EncDecAttention).*(q|v)$'. "
"Defaults to targeting the query and value matrices of all self-attention and encoder-decoder attention "
"layers."
),
parameter_metadata=LLM_METADATA["adapter"]["lora"]["target_modules"],
)
use_rslora: bool = schema_utils.Boolean(
default=False,
description=(
"When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling factor to "
"lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original "
"default value of lora_alpha/r. Paper: https://arxiv.org/abs/2312.03732."
),
parameter_metadata=LLM_METADATA["adapter"]["lora"]["use_rslora"],
)
use_dora: bool = schema_utils.Boolean(
default=False,
description=(
"Enable 'Weight-Decomposed Low-Rank Adaptation' (DoRA). This technique decomposes the updates of the "
"weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the "
"magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, "
"especially at low ranks. Right now, DoRA only supports non-quantized linear layers. DoRA introduces a "
"bigger overhead than pure LoRA, so it is recommended to merge weights for inference. For more "
"information, see https://arxiv.org/abs/2402.09353"
),
parameter_metadata=LLM_METADATA["adapter"]["lora"]["use_dora"],
)
def to_config(self, task_type: str = None, **kwargs) -> "PeftConfig":
from peft import LoraConfig as _LoraConfig
return _LoraConfig(
r=self.r,
lora_alpha=self.alpha,
lora_dropout=self.dropout,
bias=self.bias_type,
target_modules=self.target_modules,
task_type=task_type,
use_rslora=self.use_rslora,
use_dora=self.use_dora,
)
@classmethod
def name(cls) -> str:
return "LoRA"
@classmethod
def description(cls) -> str:
return LLM_METADATA["adapter"]["lora"]["type"].long_description
@DeveloperAPI
@ludwig_dataclass
class BasePromptLearningConfig(BaseAdapterConfig):
"""Config for prompt learning adapters. Not meant to be used directly.
Adapted from https://github.com/huggingface/peft/blob/main/src/peft/utils/config.py (PromptLearningConfig)
"""
num_virtual_tokens: int = schema_utils.PositiveInteger(
default=8,
description="Number of virtual tokens to add to the prompt. Virtual tokens are used to control the behavior of "
" the model during inference. ",
parameter_metadata=LLM_METADATA["adapter"]["prompt_learning"]["num_virtual_tokens"],
)
token_dim: int | None = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The hidden embedding dimension of the base transformer model.",
)
num_transformer_submodules: int | None = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The number of transformer submodules in the base transformer model.",
)
num_attention_heads: int | None = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The number of attention heads in the base transformer model.",
)
num_layers: int | None = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="The number of layers in the base transformer model.",
)
# TODO(travis): fix text generation when using prompt tuning:
# RuntimeError: shape '[-1, 17]' is invalid for input of size 9
# @DeveloperAPI
# @register_adapter("prompt_tuning")
# @ludwig_dataclass
# class PromptTuningConfig(BasePromptLearningConfig):
# """Adapted from https://github.com/huggingface/peft/blob/main/src/peft/tuners/prompt_tuning.py."""
# def __post_init__(self):
# if self.prompt_tuning_init == "TEXT" and not self.prompt_tuning_init_text:
# raise ConfigValidationError(
# "Must provide `prompt_tuning_init_text` when `prompt_tuning_init` is set to `TEXT`."
# )
"""# type: str = schema_utils.ProtectedString("prompt_tuning")""" # Quotes allow mypy to run without syntax errors.
# prompt_tuning_init: str = schema_utils.StringOptions(
# ["RANDOM", "TEXT"],
# default="RANDOM",
# description="The type of initialization to use for the prompt embedding. ",
# parameter_metadata=LLM_METADATA["adapter"]["prompt_tuning"]["prompt_tuning_init"],
# )
# prompt_tuning_init_text: str = schema_utils.String(
# default="",
# description="The text to use to initialize the prompt embedding.",
# parameter_metadata=LLM_METADATA["adapter"]["prompt_tuning"]["prompt_tuning_init_text"],
# )
# def to_config(self, **kwargs) -> "PeftConfig":
# from peft import PromptTuningConfig as _PromptTuningConfig
# return _PromptTuningConfig(
# num_virtual_tokens=self.num_virtual_tokens,
# token_dim=self.token_dim,
# num_transformer_submodules=self.num_transformer_submodules,
# num_attention_heads=self.num_attention_heads,
# num_layers=self.num_layers,
# prompt_tuning_init=self.prompt_tuning_init,
# prompt_tuning_init_text=self.prompt_tuning_init_text,
# **kwargs
# )
# TODO(travis): fix prefix tuning and p-tuning to work with DDP
# @DeveloperAPI
# @register_adapter("prefix_tuning")
# @schema_utils.ludwig_dataclass
# class PrefixTuningConfig(BasePromptLearningConfig):
# """Adapted from https://github.com/huggingface/peft/blob/main/src/peft/tuners/prefix_tuning.py."""
"""# type: str = schema_utils.ProtectedString("prefix_tuning")""" # Quotes allow mypy to run without syntax errors.
# encoder_hidden_size: Optional[int] = schema_utils.Integer(
# default=None,
# allow_none=True,
# description="The hidden embedding dimension of the prompt encoder.",
# )
# prefix_projection: bool = schema_utils.Boolean(
# default=False,
# description="Whether to use a projection layer in the prompt encoder to project the prefix tokens",
# )
# def to_config(self, task_type: str = None, **kwargs) -> "PeftConfig":
# from peft import PrefixTuningConfig as _PrefixTuningConfig
# return _PrefixTuningConfig(
# num_virtual_tokens=self.num_virtual_tokens,
# token_dim=self.token_dim,
# num_transformer_submodules=self.num_transformer_submodules,
# num_attention_heads=self.num_attention_heads,
# num_layers=self.num_layers,
# encoder_hidden_size=self.encoder_hidden_size,
# prefix_projection=self.prefix_projection,
# task_type=task_type,
# )
# @DeveloperAPI
# @register_adapter("p_tuning")
# @ludwig_dataclass
# class PTuningConfig(BasePromptLearningConfig):
"""# type: str = schema_utils.ProtectedString("p_tuning")""" # Quotes allow mypy to run without syntax errors.
# encoder_reparameterization_type: str = schema_utils.StringOptions(
# ["MLP", "LSTM"],
# default="MLP",
# allow_none=False,
# description="The type of reparameterization to use for the prompt encoder.",
# )
# encoder_hidden_size: Optional[int] = schema_utils.PositiveInteger(
# default=None,
# allow_none=True,
# description="The hidden embedding dimension of the prompt encoder.",
# )
# encoder_num_layers: Optional[int] = schema_utils.PositiveInteger(
# default=2,
# allow_none=True,
# description="The number of layers in the prompt encoder.",
# )
# encoder_dropout: Optional[float] = schema_utils.FloatRange(
# default=0.0,
# min=0.0,
# max=1.0,
# description="The dropout probability for the prompt encoder.",
# )
# def to_config(self, task_type: str = None, **kwargs) -> "PeftConfig":
# from peft import PromptEncoderConfig as _PromptEncoderConfig
# return _PromptEncoderConfig(
# num_virtual_tokens=self.num_virtual_tokens,
# token_dim=self.token_dim,
# num_transformer_submodules=self.num_transformer_submodules,
# num_attention_heads=self.num_attention_heads,
# num_layers=self.num_layers,
# encoder_reparameterization_type=self.encoder_reparameterization_type,
# encoder_hidden_size=self.encoder_hidden_size,
# encoder_num_layers=self.encoder_num_layers,
# encoder_dropout=self.encoder_dropout,
# task_type=task_type,
# )
@DeveloperAPI
@register_adapter("adalora")
@ludwig_dataclass
class AdaloraConfig(LoraConfig):
"""Adapted from https://github.com/huggingface/peft/blob/main/src/peft/tuners/adalora.py."""
type: str = schema_utils.ProtectedString(
"adalora",
description=LLM_METADATA["adapter"]["adalora"]["type"].long_description,
)
target_r: int = schema_utils.PositiveInteger(
default=8,
description="Target Lora Matrix Dimension. The target average rank of incremental matrix.",
)
init_r: int = schema_utils.PositiveInteger(
default=12,
description="Initial Lora Matrix Dimension. The initial rank for each incremental matrix.",
)
tinit: int = schema_utils.NonNegativeInteger(
default=0,
description="The steps of initial fine-tuning warmup.",
)
tfinal: int = schema_utils.NonNegativeInteger(
default=0,
description="The steps of final fine-tuning warmup.",
)
delta_t: int = schema_utils.NonNegativeInteger(
default=1,
description="The time internval between two budget allocations. The step interval of rank allocation.",
)
beta1: float = schema_utils.FloatRange(
default=0.85,
min=0.0,
max=1.0,
description="The hyperparameter of EMA for sensitivity smoothing.",
)
beta2: float = schema_utils.FloatRange(
default=0.85,
min=0.0,
max=1.0,
description=" The hyperparameter of EMA for undertainty quantification.",
)
orth_reg_weight: float = schema_utils.FloatRange(
default=0.5,
min=0.0,
max=1.0,
description="The coefficient of orthogonality regularization.",
)
total_step: int = schema_utils.PositiveInteger(
default=10000,
allow_none=False,
description="The total training steps for AdaLoRA rank allocation scheduling. "
"Must be a positive integer (required by peft >= 0.14).",
)
rank_pattern: dict | None = schema_utils.Dict(
default=None,
allow_none=True,
description="The allocated rank for each weight matrix by RankAllocator.",
)
def to_config(self, **kwargs) -> "PeftConfig":
from peft import AdaLoraConfig as _AdaLoraConfig
return _AdaLoraConfig(
r=self.r,
lora_alpha=self.alpha,
lora_dropout=self.dropout,
bias=self.bias_type,
target_r=self.target_r,
init_r=self.init_r,
tinit=self.tinit,
tfinal=self.tfinal,
deltaT=self.delta_t,
beta1=self.beta1,
beta2=self.beta2,
orth_reg_weight=self.orth_reg_weight,
total_step=self.total_step,
rank_pattern=self.rank_pattern,
)
@classmethod
def name(cls) -> str:
return "AdaLoRA"
@classmethod
def description(cls) -> str:
return LLM_METADATA["adapter"]["adalora"]["type"].long_description
@DeveloperAPI
# TODO: 02/21/2024: Disabling AdaptionPrompt (waiting for PEFT release to fix
# "TypeError: LlamaRotaryEmbedding.forward() missing 1 required positional argument: 'position_ids')"
# (this is reflected in https://github.com/ludwig-ai/ludwig/issues/3938).
#
# @register_adapter("adaption_prompt")
@ludwig_dataclass
class AdaptionPromptConfig(BaseAdapterConfig):
"""Adapted from https://github.com/huggingface/peft/blob/main/src/peft/tuners/adaption_prompt/config.py."""
def __post_init__(self):
if not self.adapter_len:
raise ConfigValidationError(
"`adapter_len` must be set to a value greater than 0 when finetuning is enabled and the adapter"
"type is `adaption_prompt`. This is the length of the adaption prompt to insert."
)
if not self.adapter_layers:
raise ConfigValidationError(
"`adapter_layers` must be set to a value greater than 0 when finetuning is enabled and the adapter"
"type is `adaption_prompt`. This is the number of adapter layers to insert."
)
type: str = schema_utils.ProtectedString(
"adaption_prompt",
description=LLM_METADATA["adapter"]["adaption_prompt"]["type"].long_description,
)
adapter_len: int = schema_utils.PositiveInteger(
default=4,
description="Number of adapter tokens to insert.",
parameter_metadata=LLM_METADATA["adapter"]["adaption_prompt"]["adapter_len"],
)
adapter_layers: int = schema_utils.PositiveInteger(
default=1,
allow_none=False,
description="Number of adapter layers to insert (from the top).",
parameter_metadata=LLM_METADATA["adapter"]["adaption_prompt"]["adapter_layers"],
)
def to_config(self, task_type: str = None, **kwargs) -> "PeftConfig":
from peft import AdaptionPromptConfig as _AdaptionPromptConfig
return _AdaptionPromptConfig(
adapter_len=self.adapter_len,
adapter_layers=self.adapter_layers,
task_type=task_type,
)
@classmethod
def name(cls) -> str:
return "Adaption Prompt"
@classmethod
def description(cls) -> str:
return LLM_METADATA["adapter"]["adaption_prompt"]["type"].long_description
@DeveloperAPI
@register_adapter("ia3")
@ludwig_dataclass
class IA3Config(BaseAdapterConfig):
type: str = schema_utils.ProtectedString(
"ia3",
description=LLM_METADATA["adapter"]["ia3"]["type"].long_description,
)
target_modules: list[str] | None = schema_utils.List(
default=None,
allow_none=True,
description="The names of the modules to apply (IA)^3 to.",
parameter_metadata=LLM_METADATA["adapter"]["ia3"]["target_modules"],
)
feedforward_modules: list[str] | None = schema_utils.List(
default=None,
allow_none=True,
description=(
"The names of the modules to be treated as feedforward modules, as in the original paper. These modules "
"will have (IA)^3 vectors multiplied to the input, instead of the output. feedforward_modules must be a "
"name or a subset of names present in target_modules."
),
parameter_metadata=LLM_METADATA["adapter"]["ia3"]["feedforward_modules"],
)
fan_in_fan_out: bool = schema_utils.Boolean(
default=False,
description=(
"Set this to True if the layer to replace stores weight like (fan_in, fan_out). For example, gpt-2 uses "
"Conv1D which stores weights like (fan_in, fan_out) and hence this should be set to True. "
),
parameter_metadata=LLM_METADATA["adapter"]["ia3"]["fan_in_fan_out"],
)
modules_to_save: list[str] | None = schema_utils.List(
list_type=str,
default=None,
allow_none=True,
description=(
"List of modules apart from (IA)^3 layers to be set as trainable and saved in the final checkpoint."
),
parameter_metadata=LLM_METADATA["adapter"]["ia3"]["modules_to_save"],
)
init_ia3_weights: bool = schema_utils.Boolean(
default=True,
description="Whether to initialize the vectors in the (IA)^3 layers, defaults to True.",
parameter_metadata=LLM_METADATA["adapter"]["ia3"]["init_ia3_weights"],
)
def to_config(self, task_type: str = None, **kwargs) -> "PeftConfig":
from peft import IA3Config as _IA3Config
return _IA3Config(
target_modules=self.target_modules,
feedforward_modules=self.feedforward_modules,
fan_in_fan_out=self.fan_in_fan_out,
modules_to_save=self.modules_to_save,
init_ia3_weights=self.init_ia3_weights,
task_type=task_type,
)
@classmethod
def name(cls) -> str:
return "IA3"
@classmethod
def description(cls) -> str:
return LLM_METADATA["adapter"]["ia3"]["type"].long_description
@DeveloperAPI
def get_adapter_conds():
conds = []
for adapter_type, adapter_cls in adapter_registry.items():
other_props = schema_utils.unload_jsonschema_from_marshmallow_class(adapter_cls)["properties"]
schema_utils.remove_duplicate_fields(other_props)
preproc_cond = schema_utils.create_cond(
{"type": adapter_type},
other_props,
)
conds.append(preproc_cond)
return conds
@DeveloperAPI
def AdapterDataclassField(default: str | None = None):
description = "Whether to use parameter-efficient fine-tuning"
class AdapterSelection(schema_utils.TypeSelection):
def __init__(self):
super().__init__(
registry=adapter_registry,
default_value=default,
description=description,
parameter_metadata=None,
allow_str_value=True,
allow_none=True,
)
def get_schema_from_registry(self, key: str) -> type[schema_utils.BaseMarshmallowConfig]:
return adapter_registry[key]
@staticmethod
def _jsonschema_type_mapping():
return {
"oneOf": [
{
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": list(adapter_registry.keys()),
"description": "The type of PEFT adapter to use during fine-tuning",
},
},
"title": "Perform parameter efficient fine-tuning",
"allOf": get_adapter_conds(),
"required": ["type"],
"description": "The type of PEFT adapter to use during fine-tuning",
"parameter_metadata": convert_metadata_to_json(LLM_METADATA["adapter"]["_oneOf"]["allOf"]),
},
{
"type": "null",
"title": "adapter_null_option",
"description": "Disable the adapter.",
"parameter_metadata": convert_metadata_to_json(LLM_METADATA["adapter"]["_oneOf"]["none"]),
},
],
"title": "adapter_options",
"description": "Whether to use parameter-efficient fine-tuning",
"parameter_metadata": convert_metadata_to_json(LLM_METADATA["adapter"]["_meta"]),
"default": default,
}
return AdapterSelection().get_default_field()
================================================
FILE: ludwig/schema/llms/prompt.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import SEMANTIC
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import LLM_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class RetrievalConfig(schema_utils.BaseMarshmallowConfig):
"""This Dataclass is a schema for the nested retrieval config under prompt."""
def __post_init__(self):
# TODO: have a dynamically loaded schema based on the selection of the type param
# https://github.com/ludwig-ai/ludwig/pull/3351#discussion_r1181910954
# Ensure k is non-zero if we're using a retrieval strategy
if self.type is not None and self.k == 0:
self.k = 1
if self.type is None and self.k != 0:
raise ConfigValidationError("k must be 0 if retrieval type is None.")
elif self.type is not None and self.k <= 0:
raise ConfigValidationError("k must be greater than 0 if retrieval type is not None.")
if self.type is None and self.model_name is not None:
raise ConfigValidationError("model_name must be None if retrieval type is None.")
elif self.type == SEMANTIC and self.model_name is None:
raise ConfigValidationError(f"model_name must not be None if retrieval type is '{SEMANTIC}'.")
type: str = schema_utils.String(
default=None,
allow_none=True,
description=(
"The type of retrieval to use for the prompt. If `None`, then no retrieval is used, and the task "
"is framed as a zero-shot learning problem. If not `None` (e.g. either 'random' or 'semantic'), then "
"samples are retrieved from an index of the training set and used to augment the input to the model "
"in a few-shot learning setting."
),
parameter_metadata=LLM_METADATA["prompt"]["retrieval"]["type"],
)
index_name: str = schema_utils.String(
default=None,
allow_none=True,
description="The name of the index to use for the prompt. Indices are stored in the ludwig cache by default.",
parameter_metadata=LLM_METADATA["prompt"]["retrieval"]["index_name"],
)
model_name: str = schema_utils.String(
default=None,
allow_none=True,
description="The model used to generate the embeddings used to retrieve samples to inject in the prompt.",
parameter_metadata=LLM_METADATA["prompt"]["retrieval"]["model_name"],
)
k: int = schema_utils.NonNegativeInteger(
default=0,
description="The number of samples to retrieve.",
parameter_metadata=LLM_METADATA["prompt"]["retrieval"]["k"],
)
@DeveloperAPI
class RetrievalConfigField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(RetrievalConfig)
def _jsonschema_type_mapping(self):
return schema_utils.unload_jsonschema_from_marshmallow_class(RetrievalConfig, title="Retrieval")
@DeveloperAPI
@ludwig_dataclass
class PromptConfig(schema_utils.BaseMarshmallowConfig):
"""This Dataclass is a schema for the nested prompt config under preprocessing."""
template: str = schema_utils.String(
default=None,
allow_none=True,
description=(
"The template to use for the prompt. Must contain at least one of the columns from the input dataset "
"or `__sample__` as a variable surrounded in curly brackets {} to indicate where to insert the "
"current feature. Multiple columns can be inserted, e.g.: `The {color} {animal} jumped over "
"the {size} {object}`, where every term in curly brackets is a column in the dataset. If a `task` "
"is specified, then the template must also contain the `__task__` variable. If `retrieval` is specified, "
"then the template must also contain the `__context__` variable. If no template is provided, then a "
"default will be used based on the retrieval settings, and a task must be set in the config."
),
parameter_metadata=LLM_METADATA["prompt"]["template"],
)
task: str = schema_utils.String(
default=None,
allow_none=True,
description="The task to use for the prompt. Required if `template` is not set.",
parameter_metadata=LLM_METADATA["prompt"]["task"],
)
retrieval: RetrievalConfig = RetrievalConfigField().get_default_field()
@DeveloperAPI
class PromptConfigField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(PromptConfig)
def _jsonschema_type_mapping(self):
return schema_utils.unload_jsonschema_from_marshmallow_class(PromptConfig)
================================================
FILE: ludwig/schema/llms/quantization.py
================================================
import warnings
from transformers import BitsAndBytesConfig
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import LLM_METADATA
from ludwig.schema.metadata.parameter_metadata import convert_metadata_to_json
from ludwig.schema.utils import ludwig_dataclass
warnings.filterwarnings(
action="ignore",
category=UserWarning,
module="bitsandbytes.cuda_setup.main",
)
@DeveloperAPI
@ludwig_dataclass
class QuantizationConfig(schema_utils.BaseMarshmallowConfig):
bits: int = schema_utils.IntegerOptions(
options=[4, 8],
default=4,
description="The quantization level to apply to weights on load.",
parameter_metadata=LLM_METADATA["quantization"]["bits"],
)
llm_int8_threshold: float = schema_utils.NonNegativeFloat(
default=6.0,
description=(
"This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit "
"Matrix Multiplication for Transformers at Scale` paper: https://arxiv.org/abs/2208.07339. Any hidden "
"states value that is above this threshold will be considered an outlier and the operation on those "
"values will be done in fp16. Values are usually normally distributed, that is, most values are in the "
"range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently "
"distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 "
"quantization works well for values of magnitude ~5, but beyond that, there is a significant performance "
"penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models "
"(small models, fine-tuning)."
),
)
llm_int8_has_fp16_weight: bool = schema_utils.Boolean(
default=False,
description=(
"This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do "
"not have to be converted back and forth for the backward pass."
),
)
bnb_4bit_compute_dtype: str = schema_utils.StringOptions(
options=["float32", "float16", "bfloat16"],
default="float16",
description=(
"This sets the computational type which might be different than the input type. For example, inputs "
"might be fp32, but computation can be set to bf16 for speedups."
),
)
bnb_4bit_use_double_quant: bool = schema_utils.Boolean(
default=True,
description=(
"This flag is used for nested quantization where the quantization constants from the first quantization "
"are quantized again."
),
)
bnb_4bit_quant_type: str = schema_utils.StringOptions(
options=["fp4", "nf4"],
default="nf4",
description="This sets the quantization data type in the bnb.nn.Linear4Bit layers.",
)
def to_bitsandbytes(self) -> BitsAndBytesConfig:
return BitsAndBytesConfig(
load_in_4bit=self.bits == 4,
load_in_8bit=self.bits == 8,
llm_int8_threshold=self.llm_int8_threshold,
llm_int8_has_fp16_weight=self.llm_int8_has_fp16_weight,
bnb_4bit_compute_dtype=self.bnb_4bit_compute_dtype,
bnb_4bit_use_double_quant=self.bnb_4bit_use_double_quant,
bnb_4bit_quant_type=self.bnb_4bit_quant_type,
)
@DeveloperAPI
class QuantizationConfigField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(QuantizationConfig, default_missing=True)
def _jsonschema_type_mapping(self):
return {
"oneOf": [
{
"type": "null",
"title": "disabled",
"description": "Disable quantization.",
"parameter_metadata": convert_metadata_to_json(LLM_METADATA["quantization"]["_oneOf"]["none"]),
},
{
**schema_utils.unload_jsonschema_from_marshmallow_class(QuantizationConfig),
"title": "enabled",
"description": "Set quantization options.",
"parameter_metadata": convert_metadata_to_json(LLM_METADATA["quantization"]["_oneOf"]["object"]),
},
],
"title": "quantization",
"description": "Set quantization options.",
"parameter_metadata": convert_metadata_to_json(LLM_METADATA["quantization"]["_meta"]),
}
================================================
FILE: ludwig/schema/lr_scheduler.py
================================================
from abc import ABC
from dataclasses import field
import ludwig.schema.utils as schema_utils
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import LOSS, MODEL_ECD, TRAINING
from ludwig.error import ConfigValidationError
from ludwig.schema.metadata import TRAINER_METADATA
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class LRSchedulerConfig(schema_utils.BaseMarshmallowConfig, ABC):
"""Configuration for learning rate scheduler parameters."""
decay: str = schema_utils.StringOptions(
options=["linear", "exponential", "cosine"],
default=None,
allow_none=True,
description="Turn on decay of the learning rate.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["decay"],
)
decay_rate: float = schema_utils.FloatRange(
default=0.96,
min=0,
max=1,
description="Decay per epoch (%): Factor to decrease the Learning rate.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["decay_rate"],
)
decay_steps: int = schema_utils.PositiveInteger(
default=10000,
description="The number of steps to take in the exponential learning rate decay.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["decay_steps"],
)
staircase: bool = schema_utils.Boolean(
default=False,
description="Decays the learning rate at discrete intervals.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["staircase"],
)
reduce_on_plateau: int = schema_utils.NonNegativeInteger(
default=0,
description=(
"How many times to reduce the learning rate when the algorithm hits a plateau (i.e. the performance on the "
"training set does not improve)"
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["reduce_on_plateau"],
)
reduce_on_plateau_patience: int = schema_utils.NonNegativeInteger(
default=10,
description=(
"How many evaluation steps have to pass before the learning rate reduces " "when `reduce_on_plateau > 0`."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["reduce_on_plateau_patience"],
)
reduce_on_plateau_rate: float = schema_utils.FloatRange(
default=0.1,
min=0,
max=1,
description="Rate at which we reduce the learning rate when `reduce_on_plateau > 0`.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["reduce_on_plateau_rate"],
)
warmup_evaluations: int = schema_utils.NonNegativeFloat(
default=0,
description="Number of evaluation steps to warmup the learning rate for.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["warmup_evaluations"],
)
warmup_fraction: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Fraction of total training steps to warmup the learning rate for.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["warmup_fraction"],
)
reduce_eval_metric: str = schema_utils.String(
default=LOSS,
allow_none=False,
description=(
"Metric plateau used to trigger when we reduce the learning rate " "when `reduce_on_plateau > 0`."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["reduce_eval_metric"],
)
reduce_eval_split: str = schema_utils.String(
default=TRAINING,
allow_none=False,
description=(
"Which dataset split to listen on for reducing the learning rate " "when `reduce_on_plateau > 0`."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["reduce_eval_split"],
)
# Parameters for CosineAnnealingWarmRestarts scheduler
t_0: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of steps before the first restart for cosine annealing decay. If not specified, it"
" will be set to `steps_per_checkpoint`.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["t_0"],
)
t_mult: int = schema_utils.PositiveInteger(
default=1,
description="Period multiplier after each restart for cosine annealing decay. Defaults to 1, i.e.,"
" restart every `t_0` steps. If set to a larger value, the period between restarts increases by that"
" multiplier. For e.g., if t_mult is 2, then the periods would be: t_0, 2*t_0, 2^2*t_0, 2^3*t_0, etc.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["t_mult"],
)
eta_min: float = schema_utils.FloatRange(
default=0,
min=0,
max=1,
description="Minimum learning rate allowed for cosine annealing decay. Default: 0.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scheduler"]["eta_min"],
)
# TODO(travis): too much boilerplate here, we should find a way to abstract all this and only require specifying the
# minimal amount needed for the new config object.
@DeveloperAPI
def LRSchedulerDataclassField(description: str, default: dict = None):
"""Returns custom dataclass field for `LRSchedulerConfig`. Allows `None` by default.
Args:
description: Description of the dataclass field
default: dict that specifies param values that will be loaded by its schema class (default: None).
"""
allow_none = True
default = default or {}
class LRSchedulerMarshmallowField(schema_utils.LudwigSchemaField):
"""Custom field class for learning rate scheduler.
Deserializes a dict to a valid instance of `LRSchedulerConfig` and creates a corresponding JSON schema for
external usage.
"""
def _deserialize(self, value, attr, data, **kwargs):
if value is None:
return value
if isinstance(value, dict):
try:
return LRSchedulerConfig.Schema().load(value)
except (TypeError, ConfigValidationError) as e:
raise ConfigValidationError(
f"Invalid params for learning rate scheduler: {value}, see LRSchedulerConfig class. Error: {e}"
)
raise ConfigValidationError("Field should be None or dict")
def _jsonschema_type_mapping(self):
return {
**schema_utils.unload_jsonschema_from_marshmallow_class(LRSchedulerConfig),
"title": "learning_rate_scheduler_options",
"description": description,
}
if not isinstance(default, dict):
raise ConfigValidationError(f"Invalid default: `{default}`")
load_default = lambda: LRSchedulerConfig.Schema().load(default)
dump_default = LRSchedulerConfig.Schema().dump(default)
return field(
metadata={
"marshmallow_field": LRSchedulerMarshmallowField(
allow_none=allow_none,
load_default=load_default,
dump_default=dump_default,
metadata={
"description": description,
},
)
},
default_factory=load_default,
)
================================================
FILE: ludwig/schema/metadata/__init__.py
================================================
import os
from typing import Any
import yaml
from ludwig.schema.metadata.parameter_metadata import ParameterMetadata
_PATH_HERE = os.path.abspath(os.path.dirname(__file__))
_CONFIG_DIR = os.path.join(_PATH_HERE, "configs")
def _to_metadata(d: dict[str, Any]) -> ParameterMetadata | dict[str, Any]:
is_nested = False
for k, v in list(d.items()):
if isinstance(v, dict):
d[k] = _to_metadata(v)
is_nested = True
if is_nested:
return d
return ParameterMetadata.from_dict(d)
def _load(fname: str) -> dict[str, Any]:
with open(os.path.join(_CONFIG_DIR, fname)) as f:
return _to_metadata(yaml.safe_load(f))
COMMON_METADATA = _load("common.yaml")
COMBINER_METADATA = _load("combiners.yaml")
DECODER_METADATA = _load("decoders.yaml")
ENCODER_METADATA = _load("encoders.yaml")
FEATURE_METADATA = _load("features.yaml")
PREPROCESSING_METADATA = _load("preprocessing.yaml")
TRAINER_METADATA = _load("trainer.yaml")
OPTIMIZER_METADATA = _load("optimizers.yaml")
LOSS_METADATA = _load("loss.yaml")
LLM_METADATA = _load("llm.yaml")
================================================
FILE: ludwig/schema/metadata/configs/combiners.yaml
================================================
comparator:
type:
short_description: Used for recommendation problems, features associated with distinct entities, output depends on entity-level comparison.
long_description:
The comparator combiner compares the hidden representation of two entities defined by lists of
features. It assumes all outputs from encoders are tensors of size `b x h` where `b` is the batch
size and `h` is the hidden dimension, which can be different for each input. If the input tensors
have a different shape, it automatically flattens them. It then concatenates the representations
of each entity and projects them both to vectors of size `output_size`. Finally, it compares the
two entity representations by dot product, element-wise multiplication, absolute difference and
bilinear product. It returns the final `b x h` tensor where `h` is the size of the concatenation
of the four comparisons.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
entity_1:
literature_references:
- https://ludwig.ai/0.6/configuration/combiner/#comparator-combiner
ui_display_name: Entity 1
expected_impact: 3
entity_2:
literature_references:
- https://ludwig.ai/0.6/configuration/combiner/#comparator-combiner
ui_display_name: Entity 2
expected_impact: 3
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
ui_display_name: null
expected_impact: 1
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 3
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 15 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values: "TRUE"
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster."
- "Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
concat:
type:
short_description: Concatenates outputs of all encoders and passes concatenated representation through stack of fully connected layers.
long_description:
The concat combiner assumes all outputs from encoders are tensors of size `b x h` where `b` is
the batch size and `h` is the hidden dimension, which can differ for each input. It
concatenates along the `h` dimension, and then (optionally) passes the concatenated tensor
through a stack of fully connected layers. It returns the final `b x h` tensor where `h` is the
size of the last fully connected layer or the sum of the sizes of the `h` of all inputs in the
case there are no fully connected layers. If there is only a single input feature and no fully
connected layers, the output of the input feature encoder is passed through the combiner
unchanged.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
flatten_inputs:
ui_display_name: null
expected_impact: 1
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
ui_display_name: null
expected_impact: 1
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 3
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 16 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
residual:
ui_display_name: null
expected_impact: 1
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values: "TRUE"
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster."
- "Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
project_aggregate:
type:
short_description: Projects the encoder outputs to a common size then takes the average.
long_description:
The project aggregate combiner projects the input vectors to a common size
and then aggregates them by taking the average across all the vectors.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
ui_display_name: null
expected_impact: 1
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 3
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 17 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
projection_size:
ui_display_name: null
expected_impact: 1
residual:
ui_display_name: null
expected_impact: 1
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values: "TRUE"
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster."
- "Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
sequence:
type:
short_description: Stacks a sequence concat combiner with a sequence encoder.
long_description:
The sequence combiner stacks a sequence concat combiner with a sequence encoder. All the
considerations about input tensor ranks described for the sequence concat combiner apply also in
this case, but the main difference is that this combiner uses the `b x s x h` output of the
sequence concat combiner, where `b` is the batch size, `s` is the sequence length and `h` is the
sum of the hidden dimensions of all input features, as input for any of the sequence encoders
described in the sequence features encoders section. All considerations on the shape of
the outputs for the sequence encoders also apply to the sequence combiner.
encoder:
ui_display_name: null
expected_impact: 3
main_sequence_feature:
ui_display_name: null
expected_impact: 3
reduce_output:
ui_display_name: null
expected_impact: 1
sequence_concat:
type:
short_description: Concatenates the outputs of multiple sequence features.
long_description:
The sequence_concat combiner assumes at least one output from the encoders is a tensor of size
`b x s x h` where `b` is the batch size, `s` is the length of the sequence and `h` is the hidden
dimension. A sequence-like (sequence, text or time series) input feature can be specified with
the `main_sequence_feature` parameter which takes the name of sequence-like input feature as its
value. If no `main_sequence_feature` is specified, the combiner will look through all the
features in the order they are defined in the configuration and will look for a feature with a
rank 3 tensor output (sequence, text or time series). If it cannot find one it will raise an
exception, otherwise the output of that feature will be used for concatenating the other features
along the sequence `s` dimension.
If there are other input features with a rank 3 output tensor, the combiner will concatenate
them alongside the s dimension, which means that all of them must have identical s dimension,
otherwise a dimension mismatch error will be returned thrown during training when a datapoint
with two sequential features of different lengths are provided.
Other features that have a b x h rank 2 tensor output will be replicated s times and
concatenated to the s dimension. The final output is a b x s x h' tensor where h' is the size of
the concatenation of the h dimensions of all input features.
main_sequence_feature:
ui_display_name: null
expected_impact: 3
reduce_output:
ui_display_name: null
expected_impact: 1
tabnet:
type:
short_description: Tabnet is specifically tailored for high performance on tabular data.
long_description:
The tabnet combiner implements the TabNet model, which uses attention and sparsity to achieve
high performance on tabular data. It assumes all outputs from encoders are tensors of size b x h
where b is the batch size and h is the hidden dimension, which can be different for each input.
If the input tensors have a different shape, it automatically flattens them. It returns the
final b x h' tensor where h' is the user-specified output size.
literature_references:
- https://arxiv.org/abs/1908.07442
compute_tier: 1
bn_epsilon:
default_value_reasoning:
Default value found in popular ML packages like Keras
and Tensorflow.
description_implications:
An epsilon is added to the denominator of the batch
normalization operation so that the function converges. Setting the epsilon
to 0 is inadvisable.
example_value:
- 1.0e-05
expected_impact: 1
literature_references:
- "[Keras example](https://keras.io/api/layers/normalization_layers/batch_normalization/)"
suggested_values: 1e-3-1e-9
suggested_values_reasoning: Common epsilon choices
ui_display_name: Batch Normalization Epsilon
bn_momentum:
description_implications:
"Higher values result in faster updates, but more
sensitivity to noise in the dataset. Lower values result in slower updates.
If momentum is set to 0, moving statistics will not be updated during
training. This is likely to cause variance between train and test performance,
and is not recommended."
example_value:
- 0.05
literature_references:
- "TabNet Paper: https://arxiv.org/abs/1908.07442"
- "Torch Batch Norm: https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html"
other_information:
"`bn_momentum` is only used if `norm`: `batch`. For other
values of `norm` it has no effect.
`bn_momentum` is different from optimizer momentum. Batch norm moving
estimate statistics are updated according to the rule:
x_hat = (1 - momentum) * x_hat + momentum * x_t,
where x_hat is the estimated statistic and x_t is the new observed value."
suggested_values: 0.01-0.2
ui_display_name: Batch Norm Momentum
expected_impact: 1
bn_virtual_bs:
default_value_reasoning: Paper default.
description_implications:
Virtual Batch Normalization is a normalization method
that extends batch normalization. Regular batch normalization causes the
output of a neural network for an input example to be highly dependent
on several other inputs in the same minibatch. To avoid this problem
in virtual batch normalization (VBN), each example is normalized based
on the statistics collected on a reference batch of examples that are
chosen once and fixed at the start of training, and on itself. The reference
batch is normalized using only its own statistics. VBN is computationally
expensive because it requires running forward propagation on two minibatches
of data, so the authors use it only in the generator network. A higher
virtual batch size could improve normalization, but it also causes training
to run slower since each batch will be sampled multiple times.
expected_impact: 1
literature_references:
- https://paperswithcode.com/method/virtual-batch-normalization
ui_display_name: "Ghost Normalization: Virtual batch size"
dropout:
default_value_reasoning: Taken from published literature (https://arxiv.org/abs/1908.07442).
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
entmax_alpha:
ui_display_name: null
expected_impact: 1
entmax_mode:
ui_display_name: null
expected_impact: 1
num_shared_blocks:
ui_display_name: null
expected_impact: 1
num_steps:
ui_display_name: null
expected_impact: 1
num_total_blocks:
ui_display_name: null
expected_impact: 1
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 18 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
relaxation_factor:
ui_display_name: null
expected_impact: 1
size:
ui_display_name: null
expected_impact: 3
sparsity:
ui_display_name: null
expected_impact: 1
tabtransformer:
type:
short_description: Projects and concatenates features, then passes them through a transformer.
long_description:
The tabtransformer combiner combines input features in the following sequence of operations.
Except for binary and number features, the combiner projects features to an embedding size.
These features are concatenated as if they were a sequence and passed through a transformer.
After the transformer, the number and binary features are concatenated (which are of size 1) and
then concatenated with the output of the transformer and is passed to a stack of fully connected
layers (from TabTransformer Tabular Data Modeling Using Contextual Embeddings). It assumes all
outputs from encoders are tensors of size `b x h` where `b` is the batch size and `h` is the
hidden dimension, which can be different for each input. If the input tensors have a different
shape, it automatically flattens them. It then projects each input tensor to the same hidden /
embedding size and encodes them with a stack of Transformer layers. Finally, the transformer
combiner applies a reduction to the outputs of the Transformer stack, followed by the above
concatenation and optional fully connected layers. The output is a `b x h` tensor where `h` is the
size of the last fully connected layer or the hidden / embedding size, or a `b x n x h` where `n`
is the number of input features and `h` is the hidden / embedding size if no reduction is applied.
literature_references:
- https://arxiv.org/abs/2012.06678
compute_tier: 2
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning: Taken from published literature (https://arxiv.org/abs/1706.03762).
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embed_input_feature_name:
default_value_reasoning:
Though the ideal embedding size depends on the task
and dataset, setting the feature embedding size equal to the hidden size
and adding feature embeddings to hidden representations ('add') is a good
starting point.
description_implications:
Input feature name embeddings have been shown to
improve performance of deep learning methods on tabular data. Feature
name embeddings play a similar role to positional embeddings in a language
model, allowing the network to learn conditional dependencies between
input features.
example_value:
- 64
literature_references:
- "TabTransformer: Tabular Data Modeling Using Contextual Embeddings"
other_information:
Must be an integer, 'add', or null. If an integer, specifies
the embedding size for input feature names. Input feature name embeddings
will be concatenated to hidden representations. Must be less than or equal
to hidden_size. If 'add', input feature names use embeddings the same
size as hidden_size, and are added (element-wise) to the hidden representations.
If null, input feature embeddings are not used.
related_parameters:
- hidden_size
ui_display_name: Embed Input Feature Name
expected_impact: 3
fc_activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
example_value:
- relu
expected_impact: 1
literature_references:
- https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html
related_parameters:
- activation, activation_function, conv_activation, recurrent_activation
suggested_values: relu, alternatively leakyRelu or elu
suggested_values_reasoning:
The default value will work well in the majority
of the cases
ui_display_name: FC Activation
fc_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: FC Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
fc_residual:
ui_display_name: null
expected_impact: 1
hidden_size:
default_value_reasoning: Not too big, not too small.
description_implications:
Increasing the hidden size makes the model larger
and slower to train, increases the model's capacity to capture more complexity.
It also increases the chance of overfitting.
expected_impact: 2
suggested_values: 10 - 2048
suggested_values_reasoning:
Increasing the hidden size makes sense if the
model is underfitting. It's useful to train both smaller and larger models
to see how model capacity affects performance. This should only be explored
after the architecture of the model has been settled.
ui_display_name: Hidden Size
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
ui_display_name: null
expected_impact: 1
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 3
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
num_heads:
default_value_reasoning:
"The middle value explored in the original TabTransformer
paper. Source: https://arxiv.org/pdf/2012.06678.pdf"
description_implications:
Increasing the number of attention heads can increase
model performance at the cost of additional compute and memory.
example_value:
- 8
expected_impact: 1
literature_references:
- https://arxiv.org/pdf/2012.06678.pdf
suggested_values: 16
suggested_values_reasoning:
If your model is underperforming, increasing the
number of attention heads can improve its ability to correlate items in
a sequence.
ui_display_name: Number of attention heads
num_layers:
default_value_reasoning:
The ideal number of layers depends on the data. For
many data types, one layer is sufficient.
description_implications:
"The ideal number of transformer layers depends
on the length and complexity of input sequences, as well as the task.
For more complex tasks, and higher number of transformer layers may be
useful. However, too many layers will increase memory and slow training
while providing diminishing returns of model performance."
example_value:
- 1
expected_impact: 1
suggested_values: 1 - 12
suggested_values_reasoning:
Increasing the number of layers may improve encoder
performance. However, more layers will increase training time and may
cause overfitting. Small numbers of layers usually work best.
ui_display_name: Number of Transformer Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 19 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
reduce_output:
ui_display_name: null
expected_impact: 1
transformer_output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 2
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 20 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Transformer Output Size
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values: "TRUE"
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster."
- "Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
transformer:
type:
short_description: The transformer combiner combines input features using a stack of Transformer blocks.
long_description:
The transformer combiner combines input features using a stack of Transformer blocks (from
Attention Is All You Need). It assumes all outputs from encoders are tensors of size `b x h`
where `b` is the batch size and `h` is the hidden dimension, which can be different for each
input. If the input tensors have a different shape, it automatically flattens them. It then
projects each input tensor to the same hidden / embedding size and encodes them with a stack of
Transformer layers. Finally, the transformer combiner applies a reduction to the outputs of the
Transformer stack, followed by optional fully connected layers. The output is a `b x h` tensor
where `h` is the size of the last fully connected layer or the hidden / embedding size, or a
`b x n x h` where `n` is the number of input features and `h` is the hidden / embedding size if
no reduction is applied.
literature_references:
- https://arxiv.org/abs/1706.03762
compute_tier: 2
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning: Taken from published literature (https://arxiv.org/abs/1706.03762).
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
fc_activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
example_value:
- relu
expected_impact: 1
literature_references:
- https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html
related_parameters:
- activation, activation_function, conv_activation, recurrent_activation
suggested_values: relu, alternatively leakyRelu or elu
suggested_values_reasoning:
The default value will work well in the majority
of the cases
ui_display_name: FC Activation
fc_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: FC Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
fc_residual:
ui_display_name: null
hidden_size:
default_value_reasoning: Not too big, not too small.
description_implications:
Increasing the hidden size makes the model larger
and slower to train, increases the model's capacity to capture more complexity.
It also increases the chance of overfitting.
expected_impact: 2
suggested_values: 10 - 2048
suggested_values_reasoning:
Increasing the hidden size makes sense if the
model is underfitting. It's useful to train both smaller and larger models
to see how model capacity affects performance. This should only be explored
after the architecture of the model has been settled.
ui_display_name: Hidden Size
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
ui_display_name: null
expected_impact: 1
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 3
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
num_heads:
ui_display_name: null
expected_impact: 1
num_layers:
default_value_reasoning:
The ideal number of layers depends on the data. For
many data types, one layer is sufficient.
description_implications:
"The ideal number of transformer layers depends
on the length and complexity of input sequences, as well as the task.
For more complex tasks, and higher number of transformer layers may be
useful. However, too many layers will increase memory and slow training
while providing diminishing returns of model performance."
example_value:
- 1
expected_impact: 1
suggested_values: 1 - 12
suggested_values_reasoning:
Increasing the number of layers may improve encoder
performance. However, more layers will increase training time and may
cause overfitting. Small numbers of layers usually work best.
ui_display_name: Number of Transformer Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 21 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
reduce_output:
ui_display_name: null
expected_impact: 1
transformer_output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 2
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 22 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Transformer Output Size
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values: "TRUE"
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster."
- "Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
================================================
FILE: ludwig/schema/metadata/configs/common.yaml
================================================
activation:
default_value_reasoning: The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications: Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values: relu
suggested_values_reasoning: ReLU will work well in the majority of the cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning: It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications: It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning: It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning: Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications: "Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning: Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
fc_layers:
default_value_reasoning: By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications: The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning: It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
flatten_inputs:
ui_display_name: null
expected_impact: 1
norm:
default_value_reasoning: While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications: Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values_reasoning: Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
ui_display_name: null
expected_impact: 1
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications: Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 3
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning: The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications: If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information: If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 16 - 1024
suggested_values_reasoning: Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
residual:
ui_display_name: null
expected_impact: 1
use_bias:
default_value_reasoning: "Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications: Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information: If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values: "TRUE"
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications: The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster."
- "Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning: Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
embedding_initializer:
default_value_reasoning: According to https://arxiv.org/abs/1711.09160, choice
of embedding initialization is not important as long as the variance is
kept reasonably low.
description_implications:
According to https://arxiv.org/abs/1711.09160, choice
of embedding initialization is not important as long as the variance is
kept reasonably low.
example_value:
- kaiming
expected_impact: 1
literature_references:
- https://arxiv.org/abs/1711.09160
suggested_values: kaiming
suggested_values_reasoning: https://discuss.huggingface.co/t/state-of-the-art-technique-for-initializing-embedding-matrix/326
ui_display_name: Embedding Initialization
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications: 'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning: By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications: By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
default_value_reasoning:
If trained from scratch, embedding vectors are typically
learned alongside the rest of the model.
description_implications:
Typically this value is only set to False if pre-trained
embeddings are uploaded. Even then, it is reasonable to leave it as True
in order to fine-tune the embeddings.
expected_impact: 1
related_parameters:
- embedding_size, representation, pretrained_embeddings
ui_display_name: (under Embeddings header) Trainable?
pretrained_embeddings:
default_value_reasoning: Embeddings are commonly trained from scratch, or
incorporated as part of a pre-trained model package.
description_implications: If pretrained embeddings are specified, then the
model may have a head start in its representation of various input entities.
example_value:
- ~/Downloads/glove.6B.100d.txt
expected_impact: 0
related_parameters:
- embedding_size, embeddings_trainable
ui_display_name: Pretrained embeddings path
max_sequence_length:
default_value_reasoning: Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
vocab:
default_value_reasoning: Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
representation:
default_value_reasoning: Trainable, randomly initialized embedding vectors
often lead to more subtle representations of input entities than one-hot
vectors.
description_implications: If set to sparse, the representations for input
entities are fixed as one-hot vectors. This leads to less flexible representations
for input entities, but could lead to faster training since there are
less learnable parameters.
expected_impact: 1
other_information: ""
related_parameters:
- embedding_size, embeddings_trainable, pretrained_embeddings
ui_display_name: Representation approach
reduce_output:
default_value_reasoning: Sums the tensors along the sequence dimension.
description_implications: "\"last\", \"sum\", \"mean\", and \"max\" are the\
\ fastest and most memory-efficient operations\u2013 they result in tensors\
\ that are the same-size as a single item in the input sequence. However,\
\ these are simple aggregation operations, therefore some information\
\ may be lost. \n\n\"concat\" concatenates each tensor together, creating\
\ a `(sequence length)*(tensor size)`-element tensor. \"concat\" preserves\
\ this information, but can be very memory-intensive and should only be\
\ applied if the sequence length and/or tensor size is small. \n\n\"attention\"\
\ takes a weighted sum of the items in the sequence, where the weights\
\ for each item in the sequence are determined by the model on-the-fly\
\ based on the features of the item itself. This is both slower and and\
\ more memory-intensive than the other operations; however, it can also\
\ provide a richer \"global\" representation of the sequence."
expected_impact: 1
related_parameters:
- max_sequence_length
suggested_values: '"attention". This and the default covers 95% of use cases.'
suggested_values_reasoning: If you would like better performance and are not
compute/memory-constrained, attention-based reduction can potentially
provide a richer global representation than the default, but note that attention
reduction does not work with `cache_encoder_embeddings=true`.
ui_display_name: Sequence Reducer
================================================
FILE: ludwig/schema/metadata/configs/decoders.yaml
================================================
BaseDecoder:
type:
expected_impact: 1
fc_layers:
expected_impact: 1
num_fc_layers:
expected_impact: 3
fc_output_size:
expected_impact: 3
fc_use_bias:
expected_impact: 1
fc_weights_initializer:
expected_impact: 1
fc_bias_initializer:
expected_impact: 1
fc_norm:
expected_impact: 2
fc_norm_params:
expected_impact: 1
fc_activation:
expected_impact: 2
fc_dropout:
expected_impact: 3
Classifier:
type:
short_description:
Projects combiner output to a vector the size of the number of available classes.
long_description:
The classifier decoder is a (potentially empty) stack of fully connected layers, followed by a
projection into a vector of size of the number of available classes, followed by a sigmoid.
expected_impact: 0
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
input_size:
other_information: Internal Only
internal_only: true
related_parameters:
- "No"
ui_display_name: Not Displayed
num_classes:
other_information: Internal Only
internal_only: true
ui_display_name: Not Displayed
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values: true
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster."
- "Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
Projector:
type:
short_description:
Projects combiner output into an output vector.
long_description:
The Projector decoder is a (potentially empty) stack of fully connected layers, followed by a
projection into a tensor of the vector size (optionally followed by a softmax in the case of
multi-class classification).
expected_impact: 0
activation:
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
clip:
ui_display_name: null
expected_impact: 1
input_size:
other_information: Internal Only
internal_only: true
related_parameters:
- "No"
ui_display_name: Not Displayed
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values: true
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster."
- "Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
Regressor:
type:
short_description:
Projects combiner output to a single number.
long_description:
The regressor decoder is a (potentially empty) stack of fully connected layers, followed by a
projection to a single number.
expected_impact: 0
activation:
ui_display_name: null
expected_impact: 2
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
input_size:
other_information: Internal Only
internal_only: true
related_parameters:
- "No"
ui_display_name: Not Displayed
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values: true
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster."
- "Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
PassthroughDecoder:
type:
short_description:
Provides the raw input from the combiner.
long_description:
The passthrough decoder simply returns the raw output coming from the combiner.
expected_impact: 0
input_size:
other_information: Internal Only
internal_only: true
related_parameters:
- "No"
ui_display_name: Not Displayed
SequenceGeneratorDecoder:
type:
short_description:
Generates a sequence by sampling from the model.
long_description:
The generator decoder is a (potentially empty) stack of fully connected layers, followed by an
RNN that generates outputs feeding on its own previous predictions and generates a tensor of
size `b x s' x c`, where `b` is the batch size, `s'` is the length of the generated sequence and
`c` is the number of classes, followed by a softmax_cross_entropy. During training teacher
forcing is adopted, meaning the list of targets is provided as both inputs and outputs (shifted
by 1), while at evaluation time greedy decoding (generating one token at a time and feeding it
as input for the next step) is performed by beam search, using a beam of 1 by default. In
general a generator expects a `b x h` shaped input tensor, where `h` is a hidden dimension. The
`h` vectors are (after an optional stack of fully connected layers) fed into the rnn generator.
One exception is when the generator uses attention, as in that case the expected size of the
input tensor is `b x s x h`, which is the output of a sequence, text or time series input
feature without reduced outputs or the output of a sequence-based combiner. If a `b x h` input
is provided to a generator decoder using an RNN with attention instead, an error will be raised
during model building.
expected_impact: 0
cell_type:
ui_display_name: null
expected_impact: 3
input_size:
other_information: Internal Only
internal_only: true
related_parameters:
- "No"
ui_display_name: Not Displayed
max_sequence_length:
expected_impact: 3
ui_display_name: null
num_layers:
default_value_reasoning:
The ideal number of layers depends on the data and
task. For many data types, one layer is sufficient.
description_implications:
Increasing the number of layers may improve model
performance for longer sequences or more complex tasks.
example_value:
- 1
expected_impact: 3
suggested_values: 1-3
suggested_values_reasoning:
Increasing the number of layers may improve encoder
performance. However, more layers will increase training time and may
cause overfitting. Small numbers of layers usually work best.
ui_display_name: Number of Recurrent Layers
reduce_input:
description_implications:
"\u201Clast\u201D: Reduces tensor by taking the\
\ last non-zero element per sequence in the sequence dimension.\n\u201C\
sum\u201D: Reduces tensor by summing across the sequence dimension.\n\u201C\
mean\u201D: Reduces tensor by taking the mean of the sequence dimension.\n\
\u201Cavg\u201D: synonym for \u201Cmean\u201D.\n\u201Cmax\u201D: Reduces\
\ tensor by taking the maximum value of the last dimension across the\
\ sequence dimension.\n\u201Cconcat\u201D: Reduces tensor by concatenating\
\ the second and last dimension.\n\u201Cattention\u201D: Reduces tensor\
\ by summing across the sequence dimension after applying feedforward\
\ attention.\n\u201Cnone\u201D: no reduction."
expected_impact: 2
ui_display_name: Combiner Reduce Mode
vocab_size:
ui_display_name: Not displayed
SequenceTaggerDecoder:
type:
short_description:
Used for classifying each element of an input sequence.
long_description:
The tagger decoder is a (potentially empty) stack of fully connected layers,
followed by a projection into a tensor of size `b x s x c`, where `b` is the batch size,
`s` is the length of the sequence and `c` is the number of classes, followed by a
`softmax_cross_entropy`.
This decoder requires its input to be shaped as `b x s x h`, where `h` is
a hidden dimension, which is the output of a sequence, text or time series input feature without
reduced outputs or the output of a sequence-based combiner. This can be done by ensuring that
at least one of the sequence, text or time series input feature's encoders has `reduce_output` set to
`None`. This will prevent a `b x h` input from being provided to this decoder and an error
from being raised during model building.
The tagger decoder also requires the `reduce_input` parameter of the output feature to be set to `None`.
If this is not set, Ludwig will automatically override the value by setting it to None and log a warning.
expected_impact: 0
attention_embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
Increasing the embedding size may cause the model
to train more slowly, but the higher dimensionality can also improve overall
quality.
expected_impact: 2
suggested_values: 128 - 2048
suggested_values_reasoning:
Try models with smaller or larger embedding sizes
to observe relative impact.
ui_display_name: Attention Embedding Size
attention_num_heads:
ui_display_name: null
expected_impact: 1
input_size:
other_information: Internal Only
internal_only: true
related_parameters:
- "No"
ui_display_name: Not Displayed
max_sequence_length:
expected_impact: 3
ui_display_name: null
use_attention:
ui_display_name: null
expected_impact: 1
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
vocab_size:
ui_display_name: Not displayed
internal_only: true
UNetDecoder:
type:
short_description: The UNet decoder convolutional and up-conv layers
long_description:
Stacks of two 2D convolutional layers with optional normalization
and relu activation, preceeded by an up-conv layer in all but the
final level of the decoder.
compute_tier: 1
conv_norm:
expected_impact: 2
ui_display_name: Convolutional Normalization
height:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
internal_only: true
ui_display_name: NOT DISPLAYED
input_size:
other_information: Internal Only
internal_only: true
related_parameters:
- "No"
ui_display_name: Not Displayed
num_channels:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
internal_only: true
ui_display_name: NOT DISPLAYED
num_classes:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
internal_only: true
ui_display_name: NOT DISPLAYED
width:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
internal_only: true
ui_display_name: NOT DISPLAYED
================================================
FILE: ludwig/schema/metadata/configs/encoders.yaml
================================================
BaseEncoder:
skip:
internal_only: true
other_information: Internal Only
ui_display_name: Not Displayed
HFEncoder:
trainable:
default_value_reasoning:
Trainable is disabled by default to make the model useful for generating fast baselines, which can be
further sped up by setting `preprocessing.cache_encoder_embeddings`. In many cases strong performance
can be achieved without adjusting the weights of the pretrained model, but for best performance we
recommend setting this to true.
description_implications:
"Ludwig currently supports two variations on fine-tuning, configured via the trainable encoder parameter:
(1) modifying the weights of the pretrained encoder to adapt them to the downstream task (trainable=true),
or (2) keeping the pretrained encoder weights fixed and training a stack of dense layers that sit
downstream as the combiner and decoder modules (trainable=false, default). This is sometimes distinguished
as transfer learning. Allowing the weights to be modified by setting trainable=true can significantly
improve performance on the downstream task, but will take significantly longer to train (due to the
additional backward passes over the pretrained model parameters). Additionally, more care needs to be
taken when selecting hyperparameters when trainable=true to prevent
[catastrophic forgettng](https://en.wikipedia.org/wiki/Catastrophic_interference), whereby the
model forgets all of the valuable information it learned during pretraining."
expected_impact: 3
literature_references:
- http://d2l.ai/chapter_computer-vision/fine-tuning.html"
related_parameters:
- use_pretrained, pretrained_model, saved_weights_in_checkpoint
suggested_values:
- false
suggested_values_reasoning:
Freezing the weights (i.e. `trainable = False`)
is only worth trying if you are loading in pretrained weights. In that
case, check to see if your model is overfitting. If so, freezing the weights
(and therefore reducing model complexity) may be beneficial.
ui_display_name: Trainable
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
reduce_output:
default_value_reasoning: Sums the tensors along the sequence dimension.
description_implications:
"\"last\", \"sum\", \"mean\", and \"max\" are the\
\ fastest and most memory-efficient operations\u2013 they result in tensors\
\ that are the same-size as a single item in the input sequence. However,\
\ these are simple aggregation operations, therefore some information\
\ may be lost. \n\n\"concat\" concatenates each tensor together, creating\
\ a `(sequence length)*(tensor size)`-element tensor. \"concat\" preserves\
\ this information, but can be very memory-intensive and should only be\
\ applied if the sequence length and/or tensor size is small. \n\n\"attention\"\
\ takes a weighted sum of the items in the sequence, where the weights\
\ for each item in the sequence are determined by the model on-the-fly\
\ based on the features of the item itself. This is both slower and and\
\ more memory-intensive than the other operations; however, it can also\
\ provide a richer \"global\" representation of the sequence."
expected_impact: 1
related_parameters:
- max_sequence_length
suggested_values: '"attention". This and the default covers 95% of use cases.'
suggested_values_reasoning:
If you would like better performance and are not
compute/memory-constrained, attention-based reduction can potentially
provide a richer global representation than the default.
ui_display_name: Sequence Reducer
ALBERT:
type:
short_description: Similar to BERT with lower memory footprint and faster training.
long_description:
The `albert` encoder loads a pretrained [ALBERT](https://arxiv.org/abs/1909.11942) (default `albert-base-v2`) model
using the Hugging Face transformers package. Albert is similar to BERT, with significantly lower memory usage and
somewhat faster training time:.
compute_tier: 2
attention_probs_dropout_prob:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- hidden_dropout_prob, classifier_dropout_prob
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: attention_probs_dropout_prob
bos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: Beginning-of-Sentence Token Id
expected_impact: 1
classifier_dropout_prob:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- hidden_dropout_prob, attention_probs_dropout_prob
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: classifier_dropout_prob
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 1
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
eos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: End-of-Sentence Token Id
expected_impact: 1
hidden_act:
default_value_reasoning: Taken from huggingface.
description_implications:
Changing this activation function will only affect
the feed-forward layers of the transformer.
example_value:
- relu
expected_impact: 1
literature_references:
- "[Hugging face docs for ALBERT config](https://huggingface.co/docs/transformers/model_doc/albert#transformers.AlbertConfig.hidden_act)\n\
\r\n[Relevant StackOverflow discussion](https://ai.stackexchange.com/questions/30341/why-does-a-transformer-not-use-an-activation-function-following-the-multi-head-a)"
suggested_values: gelu
suggested_values_reasoning: Taken from huggingface defaults.
ui_display_name: Hidden Layer Activation
hidden_dropout_prob:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "attention_probs_dropout_prob,
classifier_dropout_prob"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: hidden_dropout_prob
hidden_size:
default_value_reasoning: Huggingface default.
description_implications:
Increasing the hidden size makes the model larger
and slower to train, increases the model's capacity to capture more complexity.
It also increases the chance of overfitting.
expected_impact: 1
suggested_values: 10 - 2048
suggested_values_reasoning:
Increasing the hidden size makes sense if the
model is underfitting. It's useful to train both smaller and larger models
to see how model capacity affects performance. This should only be explored
after the architecture of the model has been settled.
ui_display_name: Hidden Size
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
inner_group_num:
ui_display_name: null
expected_impact: 1
intermediate_size:
ui_display_name: null
expected_impact: 1
layer_norm_eps:
ui_display_name: null
expected_impact: 1
max_position_embeddings:
default_value_reasoning: Taken from huggingface.
description_implications:
The size of the position embeddings table. This typically coincides with the
maximum sequence length this model might ever be used with. Typically set this
to something large just in case (e.g. 512, 1024, 2048).
expected_impact: 1
suggested_values: 512
suggested_values_reasoning:
Out of the box value based on published literature.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Max Position Embeddings
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
num_attention_heads:
ui_display_name: null
expected_impact: 1
num_hidden_groups:
ui_display_name: null
expected_impact: 1
num_hidden_layers:
ui_display_name: null
expected_impact: 1
pad_token_id:
ui_display_name: null
expected_impact: 1
position_embedding_type:
ui_display_name: null
expected_impact: 1
pretrained_kwargs:
default_value_reasoning: These arguments typically don't need to be specified.
expected_impact: 1
related_parameters:
- pretrained_model_name_or_path
suggested_values: Default
ui_display_name: null
pretrained_model_name_or_path:
default_value_reasoning:
The default model is the canonical model for this
model architecture, and is therefore a good starting point for most use
cases.
description_implications:
"There are two factors to consider when choosing\
\ a pre-trained model: (1) size, and (2) task similarity. \n\nThe larger\
\ the model, the more subtle its comprehension of inputs can become. However,\
\ larger models are also more compute and memory-intensive to train.\n\
\nModels pretrained on highly-related source tasks are more likely to\
\ be successful on the target task. Consider searching the HuggingFace\
\ model repository for models trained on similar tasks."
expected_impact: 2
literature_references:
- https://arxiv.org/abs/1909.11942
related_parameters:
- use_pretrained, trainable, pretrained_kwargs
suggested_values: albert-large-v2, albert-base-chinese
suggested_values_reasoning:
"If you would like better performance and are
not compute/memory-constrained, increasing model capacity can potentially
provide a richer representation than the default. The suggested value
upsizes the model while maintaining the same model architecture.
Language models trained on general corpora typically generalize well.
Consider deviating from the default only if the text in the dataset originates
from another domain (e.g. languages other than English)."
ui_display_name: Pretrained model
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
type_vocab_size:
ui_display_name: null
expected_impact: 1
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
AutoTransformer:
type:
short_description: Automatically retrieves the architecture from the provided model name/path.
long_description:
The `auto_transformer` encoder automatically instantiates the model architecture for the specified
`pretrained_model_name_or_path`. Unlike the other HF encoders, `auto_transformer` does not provide a default value for
`pretrained_model_name_or_path`, this is its only mandatory parameter. See the Hugging Face
[AutoModels documentation](https://huggingface.co/docs/transformers/model_doc/auto) for more details.
literature_references:
- https://huggingface.co/docs/transformers/model_doc/auto
compute_tier: 2
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
expected_impact: 1
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 3
reduce_output:
ui_display_name: null
expected_impact: 1
trainable:
expected_impact: 3
ui_display_name: null
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
BERT:
type:
short_description: Bidirectional transformer great for language modeling.
long_description:
The bert encoder loads a pretrained BERT (default bert-base-uncased) model using the Hugging
Face transformers package. BERT is a bidirectional transformer pretrained using a combination of
masked language modeling objective and next sentence prediction on a large corpus comprising the
Toronto Book Corpus and Wikipedia.
literature_references:
- https://arxiv.org/abs/1810.04805
compute_tier: 2
attention_probs_dropout_prob:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- hidden_dropout_prob, classifier_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: attention_probs_dropout_prob
classifier_dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- hidden_dropout_prob, attention_probs_dropout_prob
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: classifier_dropout
gradient_checkpointing:
ui_display_name: null
expected_impact: 1
hidden_act:
default_value_reasoning: Taken from huggingface.
description_implications:
Changing this activation function will only affect
the feed-forward layers of the transformer.
example_value:
- relu
expected_impact: 1
literature_references:
- "[Huggingface docs for BERT config](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertConfig.hidden_act)\n\
\r\n[Relevant StackOverflow discussion](https://ai.stackexchange.com/questions/30341/why-does-a-transformer-not-use-an-activation-function-following-the-multi-head-a)"
suggested_values: gelu
suggested_values_reasoning: Taken from huggingface defaults.
ui_display_name: Hidden Layer Activation
hidden_dropout_prob:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- attention_probs_dropout_prob, classifier_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: hidden_dropout_prob
hidden_size:
default_value_reasoning: Huggingface default.
description_implications:
Increasing the hidden size makes the model larger
and slower to train, increases the model's capacity to capture more complexity.
It also increases the chance of overfitting.
expected_impact: 1
suggested_values: 10 - 2048
suggested_values_reasoning:
Increasing the hidden size makes sense if the
model is underfitting. It's useful to train both smaller and larger models
to see how model capacity affects performance. This should only be explored
after the architecture of the model has been settled.
ui_display_name: Hidden Size
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
intermediate_size:
ui_display_name: null
expected_impact: 1
layer_norm_eps:
ui_display_name: null
expected_impact: 1
max_position_embeddings:
default_value_reasoning: Taken from huggingface.
description_implications:
The size of the position embeddings table. This typically coincides with the
maximum sequence length this model might ever be used with. Typically set this
to something large just in case (e.g. 512, 1024, 2048).
expected_impact: 2
suggested_values: 512
suggested_values_reasoning:
Out of the box value based on published literature.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Max Position Embeddings
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
num_attention_heads:
ui_display_name: null
expected_impact: 1
num_hidden_layers:
ui_display_name: null
expected_impact: 1
pad_token_id:
ui_display_name: null
expected_impact: 1
position_embedding_type:
ui_display_name: null
expected_impact: 1
pretrained_kwargs:
ui_display_name: null
expected_impact: 1
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
type_vocab_size:
ui_display_name: null
expected_impact: 1
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
BagEmbedWeighted:
type:
short_description: Transforms feature to vector, maps to sparse or dense embeddings, then aggregates.
long_description:
The embed weighted encoder first transforms the element frequency vector to sparse integer
lists, which are then mapped to either dense or sparse embeddings (one-hot encodings). Lastly,
embeddings are aggregated as a weighted sum where each embedding is multiplied by its respective
element's frequency. Inputs are of size b while outputs are of size b x h where b is the batch
size and h is the dimensionality of the embeddings.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
default_value_reasoning:
If trained from scratch, embedding vectors are typically
learned alongside the rest of the model.
description_implications:
Typically this value is only set to False if pre-trained
embeddings are uploaded. Even then, it is reasonable to leave it as True
in order to fine-tune the embeddings.
expected_impact: 1
related_parameters:
- embedding_size, representation, pretrained_embeddings
ui_display_name: (under Embeddings header) Trainable?
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
force_embedding_size:
default_value_reasoning:
It is not often the case that the user has a strict
need for using an embedding size that should be larger than the vocabulary
size.
description_implications:
Should only be True if the user has a strict need
for using an embedding size that should be larger than the vocabulary
size. For example, there may be size requirements across multiple features
imposed by downstream modules like the ComparatorCombiner.
expected_impact: 1
related_parameters:
- embedding_size
suggested_values:
- false
suggested_values_reasoning: True for advanced usage only.
ui_display_name: Force Embedding Size
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
pretrained_embeddings:
default_value_reasoning:
Embeddings are commonly trained from scratch, or
incorporated as part of a pre-trained model package.
description_implications:
If pretrained embeddings are specified, then the
model may have a head start in its representation of various input entities.
example_value:
- ~/Downloads/glove.6B.100d.txt
expected_impact: 0
related_parameters:
- embedding_size, embeddings_trainable
ui_display_name: Pretrained embeddings path
representation:
default_value_reasoning:
Trainable, randomly initialized embedding vectors
often lead to more subtle representations of input entities than one-hot
vectors.
description_implications:
If set to sparse, the representations for input
entities are fixed as one-hot vectors. This leads to less flexible representations
for input entities, but could lead to faster training since there are
less learnable parameters.
expected_impact: 1
other_information: ""
related_parameters:
- embedding_size, embeddings_trainable, pretrained_embeddings
ui_display_name: Representation approach
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
weights_initializer:
default_value_reasoning: Taken from published [literature](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
CTRL:
type:
short_description: Language model trained to condition on control codes that govern style, content and task-specific behavior.
long_description:
The `ctrl` encoder loads a pretrained [CTRL](https://arxiv.org/abs/1909.05858) (default `ctrl`) model using the Hugging
Face transformers package. CTRL is a conditional transformer language model trained to condition on control codes that
govern style, content, and task-specific behavior.
literature_references:
- https://arxiv.org/abs/1909.05858
compute_tier: 2
attn_pdrop:
ui_display_name: null
dff:
ui_display_name: null
embd_pdrop:
ui_display_name: null
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
layer_norm_epsilon:
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
n_ctx:
ui_display_name: null
n_embd:
ui_display_name: null
n_head:
ui_display_name: null
n_layer:
ui_display_name: null
n_positions:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
resid_pdrop:
ui_display_name: null
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
CamemBERT:
type:
short_description: Language model trained on large French text corpus.
long_description:
The `camembert` encoder loads a pretrained [CamemBERT](https://arxiv.org/abs/1911.03894)
(default `jplu/tf-camembert-base`) model using the Hugging Face transformers package. CamemBERT is pre-trained on a
large French language web-crawled text corpus.
literature_references:
- https://arxiv.org/abs/1911.03894
compute_tier: 2
attention_probs_dropout_prob:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- classifier_dropout, hidden_dropout_prob
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: attention_probs_dropout_prob
classifier_dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- attention_probs_dropout_prob, hidden_dropout_prob
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: classifier_dropout
gradient_checkpointing:
ui_display_name: null
hidden_act:
default_value_reasoning: Taken from huggingface.
description_implications:
Changing this activation function will only affect
the feed-forward layers of the transformer.
example_value:
- relu
expected_impact: 1
literature_references:
- "[Relevant StackOverflow discussion](https://ai.stackexchange.com/questions/30341/why-does-a-transformer-not-use-an-activation-function-following-the-multi-head-a)"
suggested_values: gelu
suggested_values_reasoning: Taken from huggingface defaults.
ui_display_name: Hidden Layer Activation
hidden_dropout_prob:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "attention_probs_dropout_prob, \nclassifier_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: hidden_dropout_prob
hidden_size:
default_value_reasoning: Huggingface default.
description_implications:
Increasing the hidden size makes the model larger
and slower to train, increases the model's capacity to capture more complexity.
It also increases the chance of overfitting.
expected_impact: 1
suggested_values: 10 - 2048
suggested_values_reasoning:
Increasing the hidden size makes sense if the
model is underfitting. It's useful to train both smaller and larger models
to see how model capacity affects performance. This should only be explored
after the architecture of the model has been settled.
ui_display_name: Hidden Size
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
intermediate_size:
ui_display_name: null
layer_norm_eps:
ui_display_name: null
max_position_embeddings:
default_value_reasoning: Taken from huggingface.
description_implications:
The size of the position embeddings table. This typically coincides with the
maximum sequence length this model might ever be used with. Typically set this
to something large just in case (e.g. 512, 1024, 2048).
expected_impact: 2
suggested_values: 512
suggested_values_reasoning:
Out of the box value based on published literature.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Max Position Embeddings
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
num_attention_heads:
ui_display_name: null
num_hidden_layers:
ui_display_name: null
pad_token_id:
ui_display_name: null
position_embedding_type:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
type_vocab_size:
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
CategoricalEmbed:
type:
short_description: Maps the categorical feature to a dense embedding.
long_description:
The dense encoder maps to a dense embedding and is returned as outputs of size `b x h`,
where `b` is the batch size and `h` is the dimensionality of the embeddings.
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_initializer:
default_value_reasoning:
According to https://arxiv.org/abs/1711.09160, choice
of embedding initialization is not important as long as the variance is
kept reasonably low.
description_implications:
According to https://arxiv.org/abs/1711.09160, choice
of embedding initialization is not important as long as the variance is
kept reasonably low.
example_value:
- kaiming
expected_impact: 1
literature_references:
- https://arxiv.org/abs/1711.09160
suggested_values: kaiming
suggested_values_reasoning: https://discuss.huggingface.co/t/state-of-the-art-technique-for-initializing-embedding-matrix/326
ui_display_name: Embedding Initialization
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
CategoricalSparse:
type:
short_description: Maps the categorical feature to a sparse embedding.
long_description:
The sparse encoder maps to a sparse embedding (one-hot encodings) and is returned as outputs of
size `b x h`, where `b` is the batch size and `h` is the dimensionality of the embeddings.
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_initializer:
default_value_reasoning:
According to https://arxiv.org/abs/1711.09160, choice
of embedding initialization is not important as long as the variance is
kept reasonably low.
description_implications:
According to https://arxiv.org/abs/1711.09160, choice
of embedding initialization is not important as long as the variance is
kept reasonably low.
example_value:
- kaiming
expected_impact: 1
literature_references:
- https://arxiv.org/abs/1711.09161
suggested_values: kaiming
suggested_values_reasoning: https://discuss.huggingface.co/t/state-of-the-art-technique-for-initializing-embedding-matrix/327
ui_display_name: Embedding Initialization
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
DateEmbed:
type:
short_description: Embeds the date elements passes them through fully connected layers.
long_description:
The Embed encoder passes the year through a fully connected layer of one neuron and embeds all
other elements for the date, concatenates them and passes the concatenated representation
through fully connected layers.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
DateWave:
type:
short_description: Embeds the date elements by taking the cosine of their value before passing through fully connected layers.
long_description:
The Wave encoder passes the year through a fully connected layer of one neuron and represents
all other elements for the date by taking the cosine of their value with a different period (12
for months, 31 for days, etc.), concatenates them and passes the concatenated representation
through fully connected layers.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
DenseEncoder:
type:
short_description: Passes the raw numerical values through fully connected layers.
long_description:
The dense encoder passes the raw numerical values through fully connected layers. In this case
the inputs of size `b` are transformed to size `b x h`.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
input_size:
internal_only: true
other_information: Internal Only
related_parameters:
- "No"
ui_display_name: Not Displayed
fc_layers:
ui_display_name: null
expected_impact: 1
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_layers:
default_value_reasoning:
The ideal number of layers depends on the data. For
many data types, one layer is sufficient.
description_implications:
"Increasing the number of layers may improve model
performance by allowing the model to synthesize learned features derived
from the original input. If the input is simple, ex. a category with a
few options, increasing the number of layers has no benefit. For more
complex inputs, additional layers add more 'processing power' to extract
useful information from the input.
However, more layers will increase training time and may reduce accuracy
due to overfitting."
example_value:
- 1
expected_impact: 3
other_information:
If you have multiple input features, varying the number
of layers in the combiner or output feature decoder will have more impact.
related_parameters:
- layers
suggested_values: 1-3
suggested_values_reasoning:
Increasing the number of layers may improve encoder
performance. However, more layers will increase training time and may
cause overfitting. Small numbers of layers usually work best.
ui_display_name: Number of Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
use_bias:
ui_display_name: null
expected_impact: 1
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
DistilBERT:
type:
short_description: A distilled version of BERT base that is 40% smaller and 60% faster with 95% of performance preserved.
long_description:
The `distilbert` encoder loads a pretrained [DistilBERT](https://medium.com/huggingface/distilbert-8cf3380435b5)
(default `distilbert-base-uncased`) model using the Hugging Face transformers package. DistilBERT is a small, fast, cheap and light
Transformer model trained by distilling BERT base. It has 40% less parameters than
bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured
on the GLUE language understanding benchmark.
compute_tier: 2
activation:
default_value_reasoning:
This is the default activation function used in the
Distillbert huggingface implementation
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
attention_dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- dropout, qa_dropout, seq_classif_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: attention_dropout
dim:
ui_display_name: null
dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "attention_dropout,
qa_dropout,
seq_classif_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: dropout
hidden_dim:
ui_display_name: null
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
max_position_embeddings:
default_value_reasoning: Taken from huggingface.
description_implications:
The size of the position embeddings table. This typically coincides with the
maximum sequence length this model might ever be used with. Typically set this
to something large just in case (e.g. 512, 1024, 2048).
expected_impact: 2
suggested_values: 512
suggested_values_reasoning:
Out of the box value based on published literature.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Max Position Embeddings
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
n_heads:
ui_display_name: null
n_layers:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
qa_dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- dropout, attention_dropout, seq_classif_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: qa_dropout
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
seq_classif_dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "dropout,
attention_dropout,
qa_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: seq_classif_dropout
sinusoidal_pos_embds:
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
ELECTRA:
type:
short_description: Transformer encoder that can be used to encode a sequence of tokens with little compute
long_description:
The `electra`` encoder loads a pretrained [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) model using the Hugging Face transformers package.
ELECTRA is a new pretraining approach which trains two transformer models the generator and the
discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained
as a masked language model. The discriminator, which is the model we’re interested in, tries to
identify which tokens were replaced by the generator in the sequence.
literature_references:
- https://openreview.net/pdf?id=r1xMH1BtvB
compute_tier: 2
attention_probs_dropout_prob:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- hidden_dropout_prob, classifier_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: attention_probs_dropout_prob
classifier_dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- hidden_dropout_prob, attention_probs_dropout_prob
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: classifier_dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 1
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
hidden_act:
default_value_reasoning: Taken from huggingface.
description_implications:
Changing this activation function will only affect
the feed-forward layers of the transformer.
example_value:
- relu
expected_impact: 1
literature_references:
- "[Huggingface docs for ELECTRA config](https://huggingface.co/docs/transformers/model_doc/electra#transformers.ElectraConfig.hidden_act)
[Relevant StackOverflow discussion](https://ai.stackexchange.com/questions/30341/why-does-a-transformer-not-use-an-activation-function-following-the-multi-head-a)"
suggested_values: gelu
suggested_values_reasoning: Taken from huggingface defaults.
ui_display_name: Hidden Layer Activation
hidden_dropout_prob:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "attention_probs_dropout_prob,
classifier_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: hidden_dropout_prob
hidden_size:
default_value_reasoning: Huggingface default.
description_implications:
Increasing the hidden size makes the model larger
and slower to train, increases the model's capacity to capture more complexity.
It also increases the chance of overfitting.
expected_impact: 1
suggested_values: 10 - 2048
suggested_values_reasoning:
Increasing the hidden size makes sense if the
model is underfitting. It's useful to train both smaller and larger models
to see how model capacity affects performance. This should only be explored
after the architecture of the model has been settled.
ui_display_name: Hidden Size
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
intermediate_size:
ui_display_name: null
layer_norm_eps:
ui_display_name: null
max_position_embeddings:
default_value_reasoning: Taken from huggingface.
description_implications:
The size of the position embeddings table. This typically coincides with the
maximum sequence length this model might ever be used with. Typically set this
to something large just in case (e.g. 512, 1024, 2048).
expected_impact: 2
suggested_values: 512
suggested_values_reasoning:
Out of the box value based on published literature.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Max Position Embeddings
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
num_attention_heads:
ui_display_name: null
num_hidden_layers:
ui_display_name: null
position_embedding_type:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
type_vocab_size:
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
FlauBERT:
type:
short_description: Language model with BERT related architecture trained on large French text corpus.
long_description:
The `flaubert`` encoder loads a pretrained [FlauBERT](https://arxiv.org/abs/1912.05372) (default `jplu/tf-flaubert-base-uncased``) model
using the Hugging Face transformers package. FlauBERT has an architecture similar to BERT and is
pre-trained on a large French language corpus.
literature_references:
- https://arxiv.org/abs/1912.05372
compute_tier: 2
asm:
ui_display_name: null
expected_impact: 1
attention_dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: attention_dropout
bos_index:
ui_display_name: null
expected_impact: 1
causal:
ui_display_name: null
expected_impact: 1
dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- attention_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: dropout
emb_dim:
ui_display_name: null
expected_impact: 1
embed_init_std:
ui_display_name: null
expected_impact: 1
eos_index:
ui_display_name: null
expected_impact: 1
gelu_activation:
ui_display_name: null
expected_impact: 1
init_std:
ui_display_name: null
expected_impact: 1
is_encoder:
ui_display_name: null
expected_impact: 1
lang_id:
ui_display_name: null
expected_impact: 1
layer_norm_eps:
ui_display_name: null
expected_impact: 1
layerdrop:
ui_display_name: null
expected_impact: 1
mask_index:
ui_display_name: null
expected_impact: 1
mask_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: Mask Token ID
expected_impact: 1
max_position_embeddings:
default_value_reasoning: Taken from huggingface.
description_implications:
The size of the position embeddings table. This typically coincides with the
maximum sequence length this model might ever be used with. Typically set this
to something large just in case (e.g. 512, 1024, 2048).
expected_impact: 1
suggested_values: 512
suggested_values_reasoning:
Out of the box value based on published literature.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Max Position Embeddings
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
n_heads:
ui_display_name: null
expected_impact: 1
n_langs:
default_value_reasoning: Default value used in pre-trained HF encoder.
expected_impact: 1
ui_display_name: Number of Languages
n_layers:
ui_display_name: null
expected_impact: 1
pad_index:
ui_display_name: null
expected_impact: 1
pre_norm:
ui_display_name: null
expected_impact: 1
pretrained_kwargs:
ui_display_name: null
expected_impact: 1
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
sinusoidal_embeddings:
ui_display_name: null
expected_impact: 1
trainable:
ui_display_name: null
expected_impact: 3
unk_index:
ui_display_name: null
expected_impact: 1
use_lang_emb:
ui_display_name: null
expected_impact: 1
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
GPT2:
type:
short_description: GPT-2 is a pre-trained language model used for NLP tasks like generation, summarization, and translation.
long_description: The `gpt2` encoder loads a pretrained
[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
(default `gpt2`) model using the Hugging Face transformers package. GPT-2 is a causal (unidirectional) transformer pretrained using language
modeling on a very large corpus of ~40 GB of text data.
literature_references:
- https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
compute_tier: 3
activation_function:
ui_display_name: null
attn_pdrop:
ui_display_name: null
embd_pdrop:
ui_display_name: null
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
layer_norm_epsilon:
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
n_ctx:
ui_display_name: null
n_embd:
ui_display_name: null
n_head:
ui_display_name: null
n_inner:
ui_display_name: null
n_layer:
ui_display_name: null
n_positions:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
resid_pdrop:
ui_display_name: null
scale_attn_weights:
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
GPT:
type:
short_description: GPT is a pre-trained language model used for NLP tasks like generation, summarization, and translation.
long_description: The `gpt` encoder loads a pretrained
[GPT](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
(default `openai-gpt`) model using the Hugging Face transformers package.
GPT is a causal (unidirectional) transformer pre-trained using language modeling on a large corpus with long range dependencies, the Toronto Book Corpus.
literature_references:
- https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
compute_tier: 2
afn:
ui_display_name: null
attn_pdrop:
ui_display_name: null
embd_pdrop:
ui_display_name: null
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
layer_norm_epsilon:
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
n_ctx:
ui_display_name: null
n_embd:
ui_display_name: null
n_head:
ui_display_name: null
n_layer:
ui_display_name: null
n_positions:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
resid_pdrop:
ui_display_name: null
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
H3Embed:
type:
short_description: Encodes each H3 component with embeddings then takes a sum and passes them through fully connected layers.
long_description:
The Embed encoder encodes each component of the H3 representation (mode, edge, resolution,
base cell and children cells) with embeddings. Children cells with value 0 will be masked out.
After the embedding, all embeddings are summed and optionally passed through a stack of fully
connected layers.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
reduce_output:
default_value_reasoning: Sums the tensors along the sequence dimension.
description_implications:
"\"last\", \"sum\", \"mean\", and \"max\" are the\
\ fastest and most memory-efficient operations\u2013 they result in tensors\
\ that are the same-size as a single item in the input sequence. However,\
\ these are simple aggregation operations, therefore some information\
\ may be lost. \n\n\"concat\" concatenates each tensor together, creating\
\ a `(sequence length)*(tensor size)`-element tensor. \"concat\" preserves\
\ this information, but can be very memory-intensive and should only be\
\ applied if the sequence length and/or tensor size is small. \n\n\"attention\"\
\ takes a weighted sum of the items in the sequence, where the weights\
\ for each item in the sequence are determined by the model on-the-fly\
\ based on the features of the item itself. This is both slower and and\
\ more memory-intensive than the other operations; however, it can also\
\ provide a richer \"global\" representation of the sequence."
expected_impact: 1
related_parameters:
- max_sequence_length
suggested_values: '"attention". This and the default covers 95% of use cases.'
suggested_values_reasoning:
If you would like better performance and are not
compute/memory-constrained, attention-based reduction can potentially
provide a richer global representation than the default.
ui_display_name: Sequence Reducer
use_bias:
ui_display_name: null
expected_impact: 1
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
H3RNN:
type:
short_description: Encodes each H3 component with embeddings then passes them through an RNN encoder.
long_description:
The RNN encoder encodes each component of the H3 representation (mode, edge, resolution,
base cell and children cells) with embeddings. Children cells with value 0 will be masked out.
After the embedding, all embeddings are passed through an RNN encoder. The intuition behind this
is that, starting from the base cell, the sequence of children cells can be seen as a sequence
encoding the path in the tree of all H3 hexes.
activation:
ui_display_name: null
expected_impact: 1
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 2
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
bidirectional:
default_value_reasoning:
For short sequences, it is reasonable to use a vanilla
RNN.
description_implications:
Setting bidirectional to True may increase the compute
and memory requirements of the model, but may also increase model performance
on long sequences.
expected_impact: 0
literature_references:
- https://devopedia.org/bidirectional-rnn#:~:text=RNN%20has%20the%20limitation%20that,forward%20and%20reverse%20time%20order.
related_parameters:
- cell_type, activation, recurrent_activation, use_bias
suggested_values:
- true
suggested_values_reasoning:
"RNNs can sometimes suffer from catastrophic forgetting
(source: https://en.wikipedia.org/wiki/Catastrophic_interference ) on
long sequences. Allowing the RNN to read from both the beginning and end
of the sequence can improve its representation at each timestep."
ui_display_name: Bidirectional
cell_type:
default_value_reasoning:
The LSTM cell has proven to be the most performant
of the three cells.
description_implications:
"There are two reasons to consider other cell types:
(1) compute costs and (2) catastrophic forgetting (source: https://en.wikipedia.org/wiki/Catastrophic_interference
). RNNs have marginally less compute costs, but are prone to catastrophic
forgetting."
expected_impact: 3
related_parameters:
- "bidirectional
activation
recurrent_activation
use_bias"
ui_display_name: Cell Type
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- recurrent_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
hidden_size:
default_value_reasoning:
H3 values numbers, so a small RNN dimensionality
is likely sufficient.
description_implications:
Increasing the hidden size makes the model larger
and slower to train, increases the model's capacity to capture more complexity.
It also increases the chance of overfitting.
expected_impact: 2
suggested_values: 10 - 2048
suggested_values_reasoning:
Increasing the hidden size makes sense if the
model is underfitting. It's useful to train both smaller and larger models
to see how model capacity affects performance. This should only be explored
after the architecture of the model has been settled.
ui_display_name: Hidden Size
num_layers:
default_value_reasoning:
The ideal number of layers depends on the data. For
many data types, one layer is sufficient.
description_implications:
Increasing the number of layers may improve model
performance for longer sequences or more complex tasks.
example_value:
- 1
expected_impact: 3
other_information:
If you have multiple input features, varying the number
of layers in the combiner or output feature decoder will have more impact.
related_parameters:
- layers
suggested_values: 1-3
suggested_values_reasoning:
Increasing the number of layers may improve encoder
performance. However, more layers will increase training time and may
cause overfitting. Small numbers of layers usually work best.
ui_display_name: Number of Recurrent Layers
recurrent_activation:
default_value_reasoning: sigmoid' is commonly used
expected_impact: 1
other_information:
I don't think that this parameter is used anywhere in the
code base. It's being passed down but not used in the actual RNN forwarding
functions.
suggested_values: sigmoid, ReLu, tanh
ui_display_name: null
recurrent_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Recurrent Dropout
recurrent_initializer:
ui_display_name: null
expected_impact: 1
reduce_output:
ui_display_name: null
expected_impact: 1
unit_forget_bias:
ui_display_name: null
expected_impact: 1
use_bias:
ui_display_name: null
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
H3WeightedSum:
type:
short_description: Encodes each H3 component with embeddings then takes a weighted sum.
long_description:
The Weighted Sum encoder encodes each component of the H3 representation (mode, edge,
resolution, base cell and children cells) with embeddings. Children cells with value 0 will be
masked out. After the embedding, all embeddings are summed with a weighted sum (with learned
weights) and optionally passed through a stack of fully connected layers.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
should_softmax:
ui_display_name: null
expected_impact: 1
use_bias:
ui_display_name: null
expected_impact: 1
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
Longformer:
type:
short_description: Transformer optimized for longer text inputs.
long_description:
The `longformer` encoder loads a pretrained [Longformer](https://arxiv.org/pdf/2004.05150.pdf)
(default `allenai/longformer-base-4096`) model using the Hugging Face transformers package. Longformer is a good choice
for longer text, as it supports sequences up to 4096 tokens long.
literature_references:
- https://arxiv.org/pdf/2004.05150.pdf
compute_tier: 2
attention_window:
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
num_tokens:
ui_display_name: null
max_position_embeddings:
default_value_reasoning: Taken from huggingface.
description_implications:
"An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words or positions,
which can have a large vocbulary size. Ideally, after an embedding is
trained, it captures some of the semantics of the input by placing semantically
similar inputs close together in the embedding space.
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality."
expected_impact: 2
suggested_values: 512
suggested_values_reasoning:
Out of the box value based on published literature.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Max Position Embeddings
type_vocab_size:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
sep_token_id:
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
MLPMixer:
type:
short_description: Image encoder which applies fully connected layers to different patches of the image.
long_description:
MLP-Mixer divides the image into equal-sized patches, applying fully connected layers to each
patch to compute per-patch representations (tokens) and combining the representations with
fully-connected mixer layers.
compute_tier: 1
avg_pool:
ui_display_name: null
channel_dim:
ui_display_name: null
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embed_size:
ui_display_name: null
height:
internal_only: true
ui_display_name: null
num_channels:
ui_display_name: null
num_layers:
default_value_reasoning:
The ideal number of layers depends on the size and
complexity of the input images. The default value is used in the paper
and tested on several image datasets.
description_implications:
Increasing the number of layers may improve model
performance for larger images or more complex image tasks.
example_value:
- 8
expected_impact: 3
literature_references:
- "MLP-Mixer: An all-MLP Architecture for Vision
https://arxiv.org/abs/2105.01601"
suggested_values: 4 - 32
suggested_values_reasoning:
Values from 8 - 32 are tested in the paper. It
is possible that fewer layers will be sufficient for some tasks.
ui_display_name: Number of Layers
patch_size:
default_value_reasoning: Taken from MLP-Mixer paper.
description_implications:
"The implications of the image patch size for this\
\ layer depend on other factors, such as the true resolution of the incoming\
\ image dataset. If the patch size is kept consistent but a higher resolution\
\ image is used as input, then the resulting chunked sequence of tokens\
\ will be longer than it would have been if the input resolution was lower.\
\ \n\nThe original MLP-Mixer paper also notes that there is a tradeoff\
\ with respect to the projection units learned by a model. In their findings,\
\ a 32x32 patch size model learned very structured low frequency projection\
\ units, while the equivalent 16x16 model learned high frequencies and\
\ showed no clear structure."
expected_impact: 2
literature_references:
- "[MLP Mixer paper](https://arxiv.org/pdf/2105.01601.pdf)"
suggested_values:
- 16
- 32
suggested_values_reasoning:
16 and 32 are the values used in the original
MLP Mixer paper
ui_display_name: Patch Size
token_size:
ui_display_name: null
width:
internal_only: true
ui_display_name: null
MT5:
type:
short_description: MT5 is a multilingual variant of T5 useful for multilingual NLP use cases.
long_description:
The `mt5` encoder loads a pretrained [MT5](https://arxiv.org/abs/2010.11934) (default `google/mt5-base`) model using the
Hugging Face transformers package. MT5 is a multilingual variant of T5 trained on a dataset of 101 languages.
compute_tier: 2
d_ff:
default_value_reasoning: Default value matches the pre-trained encoder.
description_implications:
If using a pre-trained encoder, this parameter will
be automatically derived from the pre-trained model.
expected_impact: 1
ui_display_name: Dimensionality of Feed-Forward Layer
d_kv:
ui_display_name: null
d_model:
ui_display_name: null
decoder_start_token_id:
ui_display_name: null
dropout_rate:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: dropout_rate
eos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: End-of-Sentence Token Id
feed_forward_proj:
ui_display_name: null
initializer_factor:
ui_display_name: null
is_encoder_decoder:
ui_display_name: null
layer_norm_epsilon:
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
num_decoder_layers:
ui_display_name: null
num_heads:
ui_display_name: null
num_layers:
default_value_reasoning:
The default value matches the number of layers in
the default pretrained encoder.
description_implications:
"The ideal number of transformer layers depends
on the length and complexity of input sequences, as well as the task.
If using a pre-trained encoder, this parameter will be automatically derived
from the pre-trained model."
example_value:
- 8
expected_impact: 3
related_parameters:
- pretrained_model_or_path
suggested_values: 1 - 12
suggested_values_reasoning:
Increasing the number of layers may improve encoder
performance. However, more layers will increase training time and may
cause overfitting. Small numbers of layers usually work best.
ui_display_name: Number of Transformer Layers
pad_token_id:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
relative_attention_num_buckets:
ui_display_name: null
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
tie_word_embeddings:
default_value_reasoning:
Keeping the word embeddings separate ensures maximum
modeling flexibility.
description_implications:
The main tradeoff between True and False values
is in compute costs and model flexibility. If set to False, the model
will require more memory, but may be more flexible. If set to True, the
opposite is true.
example_value:
- true
expected_impact: 2
suggested_values:
- true
suggested_values_reasoning:
"If set to True, then the word embeddings will
be shared between the encoder and decoder. There are two main reasons
to set this value to True: (1) saving compute resources. Word embedding
tables can be very large and using a single table between the encoder
and decoder can cut one's memory usage in half. (2) If the domain of
the generated text is highly similar to the input text. For example, if
training a Question and Answering (QA) text model, where both the questions
and answers are in the same language, the word embeddings used by the
encoder are likely usable by the decoder and vice-versa. On the other
hand, if training a translation model between two languages, the word
embeddings are not likely to be shareable by both model components."
ui_display_name: null
tokenizer_class:
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
use_cache:
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
ParallelCNN:
type:
short_description: Default option for processing sequence, audio, and text data types.
long_description:
The Parallel CNN works by first mapping the input integer sequence b x s (where b is the batch
size and s is the length of the sequence) into a sequence of embeddings, then it passes the
embedding through a number of parallel 1d convolutional layers with different filter size (by
default 4 layers with filter size 2, 3, 4 and 5), followed by max pooling and concatenation.
This single vector concatenating the outputs of the parallel convolutional layers is then passed
through a stack of fully connected layers and returned as a b x h tensor where h is the output
size of the last fully connected layer.
compute_tier: 1
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
filter_size:
ui_display_name: null
expected_impact: 2
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
pool_function:
ui_display_name: Pooling function
expected_impact: 1
pool_size:
ui_display_name: null
expected_impact: 1
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
reduce_output:
ui_display_name: null
expected_impact: 1
representation:
ui_display_name: null
expected_impact: 1
should_embed:
internal_only: true
ui_display_name: Not displayed
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
PassthroughEncoder:
type:
short_description: Passes the raw input through to the combiner.
long_description:
The passthrough encoder simply returns the raw numerical values coming from the input
placeholders as outputs. Inputs are of size `b` while outputs are of size `b x 1` where `b` is
the batch size.
input_size:
internal_only: true
other_information: Internal Only
related_parameters:
- "No"
ui_display_name: Not Displayed
BinaryPassthroughEncoder:
type:
short_description: Passes the raw input through to the combiner.
long_description:
The passthrough encoder simply returns the raw numerical values coming from the input
placeholders as outputs. Inputs are of size `b` while outputs are of size `b x 1` where `b` is
the batch size.
input_size:
internal_only: true
other_information: Internal Only
related_parameters:
- "No"
ui_display_name: Not Displayed
CategoricalPassthroughEncoder:
type:
short_description: Passes the raw input through to the combiner.
long_description:
The passthrough encoder simply returns the raw numerical values coming from the input
placeholders as outputs. Inputs are of size `b` while outputs are of size `b x 1` where `b` is
the batch size.
input_size:
internal_only: true
other_information: Internal Only
related_parameters:
- "No"
ui_display_name: Not Displayed
ResNet:
type:
short_description: Residual network achieving very high performance on computer vision tasks.
long_description:
ResNet - short for residual network is part of a family of extremely deep architectures showing
compelling accuracy and nice convergence behaviors for computer vision applications. It is a type
of CNN architecture designed to support hundreds or thousands of convolutional layers.
compute_tier: 2
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
batch_norm_epsilon:
ui_display_name: null
batch_norm_momentum:
ui_display_name: null
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
conv_stride:
ui_display_name: null
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
first_pool_kernel_size:
ui_display_name: null
expected_impact: 1
first_pool_stride:
ui_display_name: null
expected_impact: 1
height:
internal_only: true
ui_display_name: null
kernel_size:
ui_display_name: null
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_channels:
ui_display_name: null
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
out_channels:
ui_display_name: null
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
resnet_size:
ui_display_name: null
use_bias:
ui_display_name: null
expected_impact: 1
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
width:
internal_only: true
ui_display_name: null
DeBERTa:
type:
short_description: Improved version of BERT and RoBERTa, achieving good baseline performance on many tasks.
long_description:
The [DeBERTa](https://arxiv.org/abs/2006.03654) encoder improves the BERT and RoBERTa models using
disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out performs RoBERTa
on a majority of NLU tasks with 80GB training data.
In [DeBERTa V3](https://arxiv.org/abs/2111.09543), the authors further improved the efficiency of DeBERTa
using ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing. Compared to DeBERTa,
the V3 version significantly improves the model performance on downstream tasks.
compute_tier: 2
literature_references:
- https://arxiv.org/abs/2006.03654
- https://arxiv.org/abs/2111.09543
pretrained_model_name_or_path:
default_value_reasoning:
The default model was selected based on the benchmarking work done by IBM's
[model recycling](https://ibm.github.io/model-recycling/microsoft_deberta-v3-base_table.html) project.
In that study, the selected model ranked first among all variants of the `microsoft/deberta-v3-base`
architecture on an evaluation across 36 different datasets.
description_implications:
Considerations when selecting a pretrained model version include number of parameters (how long the model
will take to fine-tuning / perform inference), general model performance on various benchmarks, and
specific model performance on the task you wish to fine-tune it on.
expected_impact: 2
related_parameters:
- use_pretrained, trainable, pretrained_kwargs
ui_display_name: Pretrained model
RoBERTa:
type:
short_description: BERT based model that has higher accuracy and is easier parallelize due to larger mini-batches.
long_description:
The `roberta` encoder loads a pretrained [RoBERTa](https://arxiv.org/abs/1907.11692) (default `roberta-base`) model
using the Hugging Face transformers package. Replication of BERT pretraining which may match or exceed the performance
of BERT. RoBERTa builds on BERT and modifies key hyperparameters, removing the
next-sentence pretraining objective and training with much larger mini-batches and learning
rates.
literature_references:
- https://arxiv.org/abs/1907.11692
compute_tier: 2
bos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: Beginning-of-Sentence Token Id
eos_token_id:
default_value_reasoning:
example_value:
- Default value used in pre-trained HF encoder.
expected_impact: 1
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
pad_token_id:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
SequenceEmbed:
type:
short_description: Maps each element of the sequence to an embedding.
long_description:
The embed encoder simply maps each integer in the sequence to an embedding, creating a `b x s x h`
tensor where `b` is the batch size, `s` is the length of the sequence and `h` is the embedding
size. The tensor is reduced along the `s` dimension to obtain a single vector of size `h` for each
element of the batch.
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
reduce_output:
ui_display_name: null
expected_impact: 1
representation:
ui_display_name: null
expected_impact: 1
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
SequencePassthrough:
type:
short_description: Transforms sequence values to a floats then reduces to obtain a vector for each element.
long_description:
The passthrough encoder simply transforms each input value into a float value and adds a
dimension to the input tensor, creating a b x s x 1 tensor where b is the batch size and s is
the length of the sequence. The tensor is reduced along the s dimension to obtain a single
vector of size h for each element of the batch.
encoding_size:
default_value_reasoning:
The default `reduce_output` method does not use this
parameter, so by default this parameter is not set.
description_implications:
This parameter must be equal to the size of the
input. Otherwise, an error will occur.
example_value:
- 128
expected_impact: 1
related_parameters:
- reduce_output
suggested_values_reasoning: NONE
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
reduce_output:
ui_display_name: null
expected_impact: 1
SetSparseEncoder:
type:
short_description: Maps raw values to sparse integer lists, then maps to dense/sparse embeddings, then reduces to final vector.
long_description:
The Embed encoder takes the raw binary values coming from the input placeholders and transforms
them to sparse integer lists, then they are mapped to either dense or sparse embeddings (one-hot
encodings), finally they are reduced on the sequence dimension and returned as an aggregated
embedding vector. Inputs are of size b while outputs are of size b x h where b is the batch size
and h is the dimensionality of the embeddings.
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
representation:
ui_display_name: null
expected_impact: 1
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
Stacked2DCNN:
type:
short_description: Stack of 2D convolutional layers followed by an optional stack of fully connected layers.
long_description:
Stack of 2D convolutional layers with optional normalization, dropout, and down-sampling
pooling layers, followed by an optional stack of fully connected layers.
compute_tier: 1
conv_activation:
expected_impact: 1
ui_display_name: Convolutional Activation
conv_bias:
expected_impact: 1
ui_display_name: Convolutional Bias
conv_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "conv_dropout,
fc_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Convolutional Dropout
conv_norm:
expected_impact: 2
ui_display_name: Convolutional Normalization
conv_norm_params:
expected_impact: 1
ui_display_name: Convolutional Normalization Parameters
dilation:
expected_impact: 1
ui_display_name: Dilation
fc_activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
example_value:
- relu
expected_impact: 1
literature_references:
- https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html
related_parameters:
- activation, activation_function, conv_activation, recurrent_activation
suggested_values: relu, alternatively leakyRelu or elu
suggested_values_reasoning:
The default value will work well in the majority
of the cases
ui_display_name: FC Activation
fc_bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
fc_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "conv_dropout,
fc_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: FC Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
fc_norm:
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate. See Torch's documentation on batch normalization or for layer see
Torch's documentation on layer normalization.
expected_impact: 2
related_parameters:
- fc_norm_params
suggested_values: batch
ui_display_name: Fully Connected Normalization
fc_norm_params:
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
expected_impact: 2
related_parameters:
- fc_norm
suggested_values: Depends on the type of `norm` set.
ui_display_name: Fully Connected Normalization Parameters
fc_use_bias:
expected_impact: 1
ui_display_name: FC Use Bias
fc_weights_initializer:
expected_impact: 1
ui_display_name: FC Weights Initializer
groups:
expected_impact: 1
ui_display_name: Groups
height:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
internal_only: true
ui_display_name: NOT DISPLAYED
kernel_size:
expected_impact: 1
ui_display_name: Kernel Size
num_channels:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
ui_display_name: NOT DISPLAYED
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
out_channels:
expected_impact: 2
ui_display_name: Number of Output Channels
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
padding:
default_value_reasoning:
When padding is set to 'valid' like in the default
case, no padding is added. As a default value putting in the raw image
is the goal here.
description_implications:
By increasing the amount of padding, you can increase
the accuracy of the image analysis for certain circumstances.
example_value:
- "'same'"
expected_impact: 1
literature_references:
- https://www.geeksforgeeks.org/cnn-introduction-to-padding/
related_parameters:
- "padding_mode,
resize method"
suggested_values:
"Same' padding if images are of different dimensions. \n\
Specific [h, w] entries can be valuable on a per dataset basis."
suggested_values_reasoning:
If your images already have padding, there is
no need to add padding, so the default is fine. If your images come in
different dimensions, then 'same' padding can help pad the images to standardized
dimensions. For certain images, adding padding to the edges can help the
CNN process the images better which can improve model performance. This
depends on the images however.
ui_display_name: Padding
padding_mode:
expected_impact: 1
ui_display_name: Padding Mode
pool_dilation:
expected_impact: 1
ui_display_name: Pool Dilation
pool_kernel_size:
expected_impact: 1
ui_display_name: Pool Kernel Size
pool_padding:
expected_impact: 1
ui_display_name: Pool Padding
pool_stride:
expected_impact: 1
ui_display_name: Pool Stride
stride:
expected_impact: 1
ui_display_name: Stride
width:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
internal_only: true
ui_display_name: NOT DISPLAYED
StackedCNN:
type:
short_description: Maps inputs to embeddings then passes them through a stack of 1d convolutional layers.
long_description:
The Stacked CNN works by first mapping the input integer sequence b x s (where b is the batch
size and s is the length of the sequence) into a sequence of embeddings, then it passes the
embedding through a stack of 1d convolutional layers with different filter size (by default 6
layers with filter size 7, 7, 3, 3, 3 and 3), followed by an optional final pool and by a
flatten operation. This single flatten vector is then passed through a stack of fully connected
layers and returned as a b x h tensor where h is the output size of the last fully connected
layer.
compute_tier: 1
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dilation_rate:
default_value_reasoning:
The standard discrete convolution is the same as
a 1-dilated convolution.
description_implications:
Higher dilation rates increase the effective size
of the convolutional filter. Dilated convolution may improve performance
if the data is very correlated locally and also contains long-term dependencies.
example_value:
- 2
expected_impact: 1
other_information: Dilated convolution is also known as atrous convolution.
related_parameters:
- filter_size
suggested_values: 1-3
suggested_values_reasoning:
The dilation rate is a factor which increases
the spacing between elements of the convolutional filter
ui_display_name: Dilation Rate
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
num_filters:
ui_display_name: null
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
padding:
ui_display_name: null
pool_function:
ui_display_name: null
expected_impact: 1
pool_padding:
ui_display_name: null
expected_impact: 1
pool_size:
ui_display_name: null
expected_impact: 1
pool_strides:
ui_display_name: null
expected_impact: 1
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
reduce_output:
ui_display_name: null
expected_impact: 1
representation:
ui_display_name: null
expected_impact: 1
should_embed:
internal_only: true
ui_display_name: Not displayed
strides:
default_value_reasoning:
In general, it makes sense to have a smaller stride
that fits the input. Imagining the simple 2D image as our input, two pixels
next to eachother are strongly correlated while pixels that are further
apart will have a comparatively weaker correlation. Consequently, a higher
stride may cause significant information loss.
description_implications:
Changing the stride of a convolutional layer is
one form of downsampling (another being pooling). In the case of a large
stride, significant amounts of information is thrown away as the filter
convolves over its input. This should be usually avoided but may be desirable
in cases in which the user has some deep knowledge of the filter or of
the rest of the model architecture that makes it comfortable to allow
a higher level compression in the output feature map of this layer.
example_value:
- 1
expected_impact: 2
literature_references:
- "[d2l.ai blog post](http://d2l.ai/chapter_convolutional-neural-networks/padding-and-strides.html)
[machinelearningmastery blogpost](https://machinelearningmastery.com/padding-and-stride-for-convolutional-neural-networks/)
[crossvalidated discussion](https://stats.stackexchange.com/questions/296027/choosing-filter-size-strides-etc-in-a-cnn)"
related_parameters:
- pool_strides, default_strides, default_pool_strides, block_strides
suggested_values: 1-2
suggested_values_reasoning:
In general, points that are closer to eachother
in the input feature space will be more strongly correlated to eachother,
so it is a good idea to select a stride that captures these neighboring
relationships.
ui_display_name: Stride
use_bias:
ui_display_name: null
expected_impact: 1
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
StackedCNNRNN:
type:
short_description: Maps inputs to embeddings, passes them through convolutional layer stack, then recurrent layer stack.
long_description:
The cnnrnn encoder works by first mapping the input integer sequence b x s (where b is the batch
size and s is the length of the sequence) into a sequence of embeddings, then it passes the
embedding through a stack of convolutional layers (by default 2), that is followed by a stack of
recurrent layers (by default 1), followed by a reduce operation that by default only returns the
last output, but can perform other reduce functions.
compute_tier: 1
activation:
ui_display_name: null
expected_impact: 2
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
bidirectional:
ui_display_name: null
expected_impact: 0
cell_type:
ui_display_name: null
expected_impact: 3
conv_activation:
ui_display_name: null
expected_impact: 1
conv_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "conv_dropout,
dropout,
recurrent_dropout,
fc_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Convolutional Dropout
dilation_rate:
default_value_reasoning:
The standard discrete convolution is the same as
a 1-dilated convolution.
description_implications:
Higher dilation rates increase the effective size
of the convolutional filter. Dilated convolution may improve performance
if the data is very correlated locally and also contains long-term dependencies.
example_value:
- 2
expected_impact: 1
other_information: Dilated convolution is also known as atrous convolution.
related_parameters:
- filter_size
suggested_values: 1-3
suggested_values_reasoning:
The dilation rate is a factor which increases
the spacing between elements of the convolutional filter
ui_display_name: Dilation Rate
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "conv_dropout,
dropout,
recurrent_dropout,
fc_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
fc_activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
example_value:
- relu
expected_impact: 1
literature_references:
- https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html
related_parameters:
- activation, activation_function, conv_activation, recurrent_activation
suggested_values: relu, alternatively leakyRelu or elu
suggested_values_reasoning:
The default value will work well in the majority
of the cases
ui_display_name: FC Activation
fc_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "conv_dropout,
dropout,
recurrent_dropout,
fc_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: FC Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
filter_size:
ui_display_name: null
expected_impact: 2
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
num_filters:
ui_display_name: null
num_rec_layers:
ui_display_name: null
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
padding:
ui_display_name: null
pool_function:
ui_display_name: null
expected_impact: 1
pool_padding:
ui_display_name: null
expected_impact: 1
pool_size:
ui_display_name: null
expected_impact: 1
pool_strides:
ui_display_name: null
expected_impact: 1
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
recurrent_activation:
default_value_reasoning: sigmoid' is commonly used
expected_impact: 1
other_information:
I don't think that this parameter is used anywhere in the
code base. It's being passed down but not used in the actual RNN forwarding
functions.
suggested_values: sigmoid, ReLu, tanh
ui_display_name: null
recurrent_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "conv_dropout,
dropout,
recurrent_dropout,
fc_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Recurrent Dropout
recurrent_initializer:
ui_display_name: null
expected_impact: 1
reduce_output:
ui_display_name: null
expected_impact: 1
representation:
ui_display_name: null
expected_impact: 1
should_embed:
internal_only: true
ui_display_name: Not displayed
state_size:
ui_display_name: null
expected_impact: 3
strides:
default_value_reasoning:
In general, it makes sense to have a smaller stride
that fits the input. Imagining the simple 2D image as our input, two pixels
next to eachother are strongly correlated while pixels that are further
apart will have a comparatively weaker correlation. Consequently, a higher
stride may cause significant information loss.
description_implications:
Changing the stride of a convolutional layer is
one form of downsampling (another being pooling). In the case of a large
stride, significant amounts of information is thrown away as the filter
convolves over its input. This should be usually avoided but may be desirable
in cases in which the user has some deep knowledge of the filter or of
the rest of the model architecture that makes it comfortable to allow
a higher level compression in the output feature map of this layer.
example_value:
- 1
expected_impact: 2
literature_references:
- "[d2l.ai blog post](http://d2l.ai/chapter_convolutional-neural-networks/padding-and-strides.html)
[machinelearningmastery blogpost](https://machinelearningmastery.com/padding-and-stride-for-convolutional-neural-networks/)
[crossvalidated discussion](https://stats.stackexchange.com/questions/296027/choosing-filter-size-strides-etc-in-a-cnn)"
related_parameters:
- pool_strides, default_strides, default_pool_strides, block_strides
suggested_values: 1-2
suggested_values_reasoning:
In general, points that are closer to eachother
in the input feature space will be more strongly correlated to eachother,
so it is a good idea to select a stride that captures these neighboring
relationships.
ui_display_name: Stride
unit_forget_bias:
ui_display_name: null
expected_impact: 1
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
StackedParallelCNN:
type:
short_description: Combination of Parallel CNN and Stacked CNN encoders utilizing a stack of parallel convolutional layers.
long_description:
The stacked parallel cnn encoder is a combination of the Parallel CNN and the Stacked CNN
encoders where each layer of the stack is composed of parallel convolutional layers. It works by
first mapping the input integer sequence b x s (where b is the batch size and s is the length of
the sequence) into a sequence of embeddings, then it passes the embedding through a stack of
several parallel 1d convolutional layers with different filter size, followed by an optional
final pool and by a flatten operation. This single flattened vector is then passed through a
stack of fully connected layers and returned as a b x h tensor where h is the output size of the
last fully connected layer.
compute_tier: 1
activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
expected_impact: 2
suggested_values:
The default value will work well in the majority of the
cases
ui_display_name: Activation
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
filter_size:
ui_display_name: null
expected_impact: 2
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
num_filters:
ui_display_name: null
num_stacked_layers:
description_implications:
While superceded by `stacked_layers`, this can directly
change the depth of the current stack of parallel convolutional layers.
example_value:
- 1
expected_impact: 1
related_parameters:
- stacked_layers
ui_display_name: Number of Stacked Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
pool_function:
ui_display_name: null
expected_impact: 1
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
reduce_output:
ui_display_name: null
expected_impact: 1
representation:
ui_display_name: null
expected_impact: 1
should_embed:
internal_only: true
ui_display_name: Not displayed
stacked_layers:
ui_display_name: null
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
StackedRNN:
type:
short_description: Utilizes a stack of recurrent layers followed by a reduce operation.
long_description:
The rnn encoder works by first mapping the input integer sequence b x s (where b is the batch
size and s is the length of the sequence) into a sequence of embeddings, then it passes the
embedding through a stack of recurrent layers (by default 1 layer), followed by a reduce
operation that by default only returns the last output, but can perform other reduce functions.
compute_tier: 1
activation:
ui_display_name: null
expected_impact: 2
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
bidirectional:
ui_display_name: null
expected_impact: 0
cell_type:
ui_display_name: null
expected_impact: 3
dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "dropout,
recurrent_dropout,
fc_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
fc_activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
example_value:
- relu
expected_impact: 1
literature_references:
- https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html
related_parameters:
- activation, activation_function, conv_activation, recurrent_activation
suggested_values: relu, alternatively leakyRelu or elu
suggested_values_reasoning:
The default value will work well in the majority
of the cases
ui_display_name: FC Activation
fc_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- dropout, recurrent_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: FC Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
num_layers:
default_value_reasoning:
The ideal number of layers depends on the data. For
many data types, one layer is sufficient.
description_implications:
Increasing the number of layers may improve model
performance for longer sequences or more complex tasks.
example_value:
- 1
expected_impact: 3
suggested_values: 1-3
suggested_values_reasoning:
Increasing the number of layers may improve encoder
performance. However, more layers will increase training time and may
cause overfitting. Small numbers of layers usually work best.
ui_display_name: Number of Recurrent Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
recurrent_activation:
default_value_reasoning: sigmoid' is commonly used
expected_impact: 1
other_information:
I don't think that this parameter is used anywhere in the
code base. It's being passed down but not used in the actual RNN forwarding
functions.
suggested_values: sigmoid, ReLu, tanh
ui_display_name: null
recurrent_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "dropout,
recurrent_dropout,
fc_dropout"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Recurrent Dropout
recurrent_initializer:
ui_display_name: null
expected_impact: 1
reduce_output:
ui_display_name: null
expected_impact: 1
representation:
ui_display_name: null
expected_impact: 1
should_embed:
internal_only: true
ui_display_name: Not displayed
state_size:
ui_display_name: null
expected_impact: 3
unit_forget_bias:
ui_display_name: null
expected_impact: 1
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
StackedTransformer:
type:
short_description: Stack of transformer blocks with optional stack of fully connected layers.
long_description:
The transformer encoder implements a stack of transformer blocks, replicating the architecture
introduced in the Attention is all you need paper, and adds am optional stack of fully connected
layers at the end.
literature_references:
- https://arxiv.org/abs/1706.03762
compute_tier: 2
bias_initializer:
default_value_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights.
description_implications:
It's rare to see any performance gains from choosing
a different bias initialization. Some practitioners like to use a small
constant value such as 0.01 for all biases to ensure that all ReLU units
are activated in the beginning and have some effect on the gradient. However,
it's still an open question as to whether this provides consistent improvement.
expected_impact: 1
literature_references:
- https://cs231n.github.io/neural-networks-2/
related_parameters:
- weights_initializer
suggested_values: zeros
suggested_values_reasoning:
It is possible and common to initialize the biases
to be zero, since the asymmetry breaking is provided by the small random
numbers in the weights. For ReLU non-linearities, some people like to
use small constant value such as 0.01 for all biases because this ensures
that all ReLU units fire in the beginning and therefore obtain and propagate
some gradient. However, it is not clear if this provides a consistent
improvement (in fact some results seem to indicate that this performs
worse) and it is more common to simply use 0 bias initialization.
ui_display_name: Bias Initializer
dropout:
default_value_reasoning: Taken from published literature (https://arxiv.org/abs/1908.07442).
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- fc_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Dropout
embedding_size:
default_value_reasoning: Not too big, not too small.
description_implications:
'An embedding is a relatively low-dimensional space
that is used to translate high-dimensional vectors like words, which can
have a large vocbulary size. Ideally, after an embedding is trained, it
captures some of the semantics of the input by placing semantically similar
inputs close together in the embedding space.
In most cases, the embedding size is chosen empirically, by trial and
error. From https://www.amazon.com/dp/1098115783, "one rule of thumb is
to use the fourth root of the total number of unique categorical elements
while another is that the embedding dimension should be approximately
1.6 times the square root of the number of unique elements in the category,
and no less than 600."
Increasing the embedding size may cause the model to train more slowly,
but the higher dimensionality can also improve overall quality.'
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
suggested_values: 1.6 * sqrt(vocab_size)
suggested_values_reasoning:
Rule of thumb suggested by a deep learning textbook.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Embedding Size
embeddings_on_cpu:
default_value_reasoning:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access.
description_implications:
By default embeddings matrices are stored on GPU
memory if a GPU is used, as it allows for faster access. However, in some
cases when the vocabulary size is very large, the full embedding matrix
may be really big and unwieldy to have in GPU memory. This parameter forces
the placement of the embedding matrix in regular memory and the CPU is
used to access them. This may slow down training due to additional data
transfer between CPU and GPU memory, but can lead to healthier GPU memory
resource usage.
expected_impact: 1
suggested_values:
- false
suggested_values_reasoning:
If GPU memory is not a constraint, having embeddings
stored and accessed within the GPU is faster.
ui_display_name: Embeddings on CPU
embeddings_trainable:
ui_display_name: null
expected_impact: 1
fc_activation:
default_value_reasoning:
The Rectified Linear Units (ReLU) function is the
standard activation function used for adding non-linearity. It is simple,
fast, and empirically works well (https://arxiv.org/abs/1803.08375).
description_implications:
Changing the activation functions has an impact
on the computational load of the model and might require further hypterparameter
tuning
example_value:
- relu
expected_impact: 1
literature_references:
- https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html
related_parameters:
- activation, activation_function, conv_activation, recurrent_activation
suggested_values: relu, alternatively leakyRelu or elu
suggested_values_reasoning:
The default value will work well in the majority
of the cases
ui_display_name: FC Activation
fc_dropout:
default_value_reasoning:
Dropout can cause training to become less stable.
Consider start with a dropout-free baseline, and add dropout gradually
in subsequent experiments.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: FC Dropout
fc_layers:
default_value_reasoning:
By default the stack is built by using num_fc_layers,
output_size, use_bias, weights_initializer, bias_initializer, norm, norm_params,
activation, dropout. When a list of dictionaries is provided, the stack
is built following the parameters of each dict for building each layer.
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a big anough amount of data is provided,
but also makes the model more computationally expensive and potentially
more prone to overfitting.
example_value:
- dropout: 0.1
output_size: 128
- norm: layer
output_size: 64
expected_impact: 1
related_parameters:
- output_size
- use_bias
- weights_initializer
- bias_initializer
- norm
- norm_params
- activation
- dropout
suggested_values_reasoning:
It is easier to define a stack of fully connected
layers by just specifying num_fc_layers, output_size and the other individual
parameters. It will create a stack of layers with identical properties.
Use this parameter only if you need a fine grained level of control of
each individual layer in the stack.
ui_display_name: Fully Connected Layers
hidden_size:
default_value_reasoning: Taken from literature (https://arxiv.org/abs/1706.03762)
description_implications:
Increasing the hidden size makes the model larger
and slower to train, increases the model's capacity to capture more complexity.
It also increases the chance of overfitting.
expected_impact: 2
suggested_values: 10 - 2048
suggested_values_reasoning:
Increasing the hidden size makes sense if the
model is underfitting. It's useful to train both smaller and larger models
to see how model capacity affects performance. This should only be explored
after the architecture of the model has been settled.
ui_display_name: Hidden Size
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes and the positional embedding matrix are
computed accurately.
internal_only: true
ui_display_name: null
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer bells
and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher learning
rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to layers
deep in the network when weights are updated. For example, batch normalization
standardizes the inputs to a layer for each mini-batch. Try out different
normalizations to see if that helps with training stability
ui_display_name: Normalization Type
norm_params:
default_value_reasoning:
The default parameters that come with Torch's implementation
of these normalization types are a trusted starting point.
description_implications:
There are a variety of ways a certain set of parameters
specificed could influence performance here. Broadly speaking the different
values passed in here allow for different levels of smoothness to be observed
in the learning curves. Since setting this parameters depends on the type
of `norm` set, see [BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
for more information on the parameters to set for batch normalization,
and see [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
for more information on the parameters to set for layer normalization.
example_value:
- affine: false
momentum: 0.2
num_features: 100
expected_impact: 1
literature_references:
- "For BatchNorm2d: https://arxiv.org/abs/1502.03167
For LayerNorm: https://arxiv.org/abs/1607.06450"
related_parameters:
- "`norm`"
suggested_values: Depends on the type of `norm` set.
suggested_values_reasoning: "NO"
ui_display_name: Normalization Parameters
num_fc_layers:
default_value_reasoning:
The encoder already has learnable parameters.Sometimes
the default is 1 for modules where the FC stack is used for shape management,
or the only source of learnable parameters.
description_implications:
Increasing num_fc_layers will increase the capacity
of the model. The model will be slower to train, and there's a higher
risk of overfitting.
example_value:
- 1
expected_impact: 1
other_information:
Not all modules that have fc_layers also have an accompanying
num_fc_layers parameter. Where both are present, fc_layers takes precedent
over num_fc_layers. Specifying num_fc_layers alone uses fully connected
layers that are configured by the defaults in FCStack.
related_parameters:
- fc_layers
suggested_values: 0-1
suggested_values_reasoning:
The full model likely contains many learnable
parameters. Consider starting with very few, or without any additional
fully connected layers and add them if you observe evidence of limited
model capacity. Sometimes the default is 1 for modules where the FC stack
is used for shape management, or the only source of learnable parameters.
ui_display_name: Number of Fully Connected Layers
num_heads:
ui_display_name: null
num_layers:
default_value_reasoning:
The ideal number of layers depends on the data. For
many data types, one layer is sufficient.
description_implications:
"The ideal number of transformer layers depends
on the length and complexity of input sequences, as well as the task.
For more complex tasks, and higher number of transformer layers may be
useful. However, too many layers will increase memory and slow training
while providing diminishing returns of model performance."
example_value:
- 1
expected_impact: 3
suggested_values: 1 - 12
suggested_values_reasoning:
Increasing the number of layers may improve encoder
performance. However, more layers will increase training time and may
cause overfitting. Small numbers of layers usually work best.
ui_display_name: Number of Transformer Layers
output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 3
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Output Size
pretrained_embeddings:
ui_display_name: null
expected_impact: 0
reduce_output:
ui_display_name: null
expected_impact: 1
representation:
ui_display_name: null
expected_impact: 1
should_embed:
internal_only: true
ui_display_name: Not displayed
transformer_output_size:
default_value_reasoning: A modest value, not too small, not too large.
description_implications:
If there are fully connected layers in this module,
increasing the output size of each fully connected layer will increase
the capacity of the model. However, the model may be slower to train,
and there's a higher risk of overfitting. If it seems like the model could
use even more capacity, consider increasing the number of fully connected
layers, or explore other architectures.
expected_impact: 2
other_information:
If num_fc_layers=0 and fc_layers=None, and there are no
fully connected layers defined on the module, then this parameter may
have no effect on the module's final output shape.
related_parameters:
- num_fc_layers, fc_layers
suggested_values: 10 - 1024
suggested_values_reasoning:
Increasing the output size increases the capacity
of the model. If this seems to have a positive effect, then it could be
worth increasing the number of layers, or trying a different architecture
with a larger capacity.
ui_display_name: Transformer Output Size
use_bias:
default_value_reasoning:
"Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to use bias terms.
Batch Normalization, however, adds a trainable shift parameter which is
added to the activation. When Batch Normalization is used in a layer,
bias terms are redundant and may be removed."
description_implications:
Bias terms may improve model accuracy, and don't
have much impact in terms of memory or training speed. For most models
it is reasonable to leave this parameter set to True.
example_value:
- true
expected_impact: 1
other_information:
If fc_layers is not specified, or use_bias is not specified
for individual layers, the value of use_bias will be used as the default
for all layers.
related_parameters:
- bias_initializer, fc_layers
suggested_values:
- true
ui_display_name: Use Bias
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
weights_initializer:
default_value_reasoning: Taken from [this paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
description_implications:
The method you choose to initialize layer weights
during training can have a big impact on performance as well as the reproducibility
of your final model between runs. As an example, if you were to randomly
initialize weights you would risk non-reproducibility (and possibly general
training performance), but sticking with constant values for initialization
might significantly increase the time needed for model convergence. Generally,
choosing one of the probabilistic approaches strikes a balance between
the two extremes, and the literature kicked off by the landmark [*Xavier
et al.* paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
provides a few good options. See this nice discussion from [Weights and
Biases](https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.)
for more information.
expected_impact: 1
literature_references:
- "Weights and Biases blog post: https://wandb.ai/site/articles/the-effects-of-weight-initialization-on-neural-nets#:~:text=Studies%20have%20shown%20that%20initializing,net%20train%20better%20and%20faster.
Xavier et al. paper: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf"
suggested_values: xavier_uniform
suggested_values_reasoning:
Changing the weights initialization scheme is
something to consider if a model is having trouble with convergence, or
otherwise it is something to experiment with after other factors are considered.
The default choice (`xavier_uniform`) is a suitable starting point for
most tasks.
ui_display_name: Layer Weights Initializer
T5:
type:
short_description: Text-to-text approach transformer with good transfer performance on multiple tasks.
long_description:
The `t5` encoder loads a pretrained [T5](https://arxiv.org/pdf/1910.10683.pdf) (default `t5-small`) model using the
Hugging Face transformers package. T5 (Text-to-Text Transfer Transformer) is pre-trained on a huge text dataset crawled
from the web and shows good transfer performance on multiple tasks.
compute_tier: 2
d_ff:
default_value_reasoning: Default value matches the pre-trained encoder.
description_implications:
If using a pre-trained encoder, this parameter will
be automatically derived from the pre-trained model.
expected_impact: 1
ui_display_name: Dimensionality of Feed-Forward Layer
d_kv:
ui_display_name: null
d_model:
ui_display_name: null
dropout_rate:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: dropout_rate
feed_forward_proj:
ui_display_name: null
initializer_factor:
ui_display_name: null
layer_norm_eps:
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
num_decoder_layers:
ui_display_name: null
num_heads:
ui_display_name: null
num_layers:
default_value_reasoning:
The default value matches the number of layers in
the default pretrained encoder.
description_implications:
"The ideal number of transformer layers depends
on the length and complexity of input sequences, as well as the task.
If using a pre-trained model, this parameter will be automatically derived
from the pre-trained model."
example_value:
- 6
expected_impact: 2
related_parameters:
- pretrained_model_or_path
suggested_values: 1 - 12
suggested_values_reasoning:
Increasing the number of layers may improve encoder
performance. However, more layers will increase training time and may
cause overfitting. Small numbers of layers usually work best.
ui_display_name: Number of Transformer Layers
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
relative_attention_num_buckets:
ui_display_name: null
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
TransformerXL:
type:
short_description: Transformer architecture that introduces the notion of recurrence to the deep self-attention network.
long_description:
The `transformer_xl` encoder loads a pretrained [Transformer-XL](https://arxiv.org/abs/1901.02860)
(default `transfo-xl-wt103`) model using the Hugging Face transformers package. Adds novel positional encoding scheme
which improves understanding and generation of long-form text up to thousands of tokens. Transformer-XL is a causal (uni-directional)
transformer with relative positioning (sinusoïdal) embeddings which can reuse previously
computed hidden-states to attend to longer context (memory). This model also uses adaptive
softmax inputs and outputs (tied).
compute_tier: 2
adaptive:
default_value_reasoning: Huggingface default.
description_implications:
Adaptive softmax is a speedup technique for computing
probability distributions over words. For text with large vocabulary,
adaptive softmax improves both training speed.
expected_impact: 1
related_parameters:
- vocab_size
ui_display_name: Adaptive Softmax
attn_type:
ui_display_name: null
clamp_len:
ui_display_name: null
cutoffs:
ui_display_name: null
d_embed:
ui_display_name: null
d_head:
ui_display_name: null
d_inner:
ui_display_name: null
d_model:
ui_display_name: null
div_val:
ui_display_name: null
dropatt:
ui_display_name: null
dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: dropout
eos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: End-of-Sequence Token Id
init:
ui_display_name: null
init_range:
ui_display_name: null
init_std:
ui_display_name: null
layer_norm_epsilon:
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
mem_len:
ui_display_name: null
n_head:
ui_display_name: null
n_layer:
ui_display_name: null
pre_lnorm:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
proj_init_std:
ui_display_name: null
proj_share_all_but_first:
ui_display_name: null
reduce_output:
ui_display_name: null
expected_impact: 1
same_length:
ui_display_name: null
sample_softmax:
ui_display_name: null
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
untie_r:
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
TVAlexNetEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVBaseEncoder:
model_cache_dir:
ui_display_name: Model Cache Directory
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: Saved Weights in Checkpoint
trainable:
default_value_reasoning: By default, model components are trainable.
description_implications:
The tradeoff when using `trainable` is between speed
and flexibility. If False, less weights are subject to change and the
model will therefore train faster. However, the representations output
by this component are fixed for each input.
expected_impact: 3
literature_references:
- "https://www.ibm.com/cloud/learn/overfitting
http://d2l.ai/chapter_computer-vision/fine-tuning.html"
related_parameters:
- use_pretrained, pretrained_model, saved_weights_in_checkpoint
suggested_values:
- false
suggested_values_reasoning:
Freezing the weights (i.e. `trainable = False`)
is only worth trying if you are loading in pretrained weights. In that
case, check to see if your model is overfitting. If so, freezing the weights
(and therefore reducing model complexity) may be beneficial.
ui_display_name: Trainable
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
TVConvNeXtEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVDenseNetEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVEfficientNetEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVGoogLeNetEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVInceptionV3Encoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVMaxVitEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVMNASNetEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVMobileNetV2Encoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVMobileNetV3Encoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVRegNetEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVResNetEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVResNeXtEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVShuffleNetV2Encoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVSqueezeNetEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVSwinTransformerEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVViTEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVVGGEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
TVWideResNetEncoder:
model_variant:
ui_display_name: Model Variant
type:
ui_display_name: Type
ViT:
type:
short_description: ViT encoder divides images into patches, performs a linear transformation, and then applies a transformer.
long_description:
ViT, short for Vision Transformer, divides the image into equal-sized patches, uses a linear
transformation to encode each flattened patch, then applies a deep transformer architecture to
the sequence of encoded patches.
compute_tier: 2
attention_probs_dropout_prob:
default_value_reasoning: Taken from literature (https://arxiv.org/abs/2010.11929).
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "hidden_dropout_prob,
attention_probs_dropout_prob"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Attention Dropout
gradient_checkpointing:
ui_display_name: null
height:
internal_only: true
ui_display_name: null
hidden_act:
default_value_reasoning: Taken from huggingface.
description_implications:
Changing this activation function will only affect
the feed-forward layers of the transformer.
example_value:
- relu
expected_impact: 2
literature_references:
- "[Huggingface docs for ViT config](https://huggingface.co/docs/transformers/model_doc/vit#transformers.ViTConfig.hidden_act)
[Relevant StackOverflow discussion](https://ai.stackexchange.com/questions/30341/why-does-a-transformer-not-use-an-activation-function-following-the-multi-head-a)"
suggested_values: gelu
suggested_values_reasoning: Taken from huggingface defaults.
ui_display_name: Hidden Layer Activation
hidden_dropout_prob:
default_value_reasoning: Taken from literature (https://arxiv.org/abs/2010.11929).
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 3
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- "hidden_dropout_prob,
attention_probs_dropout_prob"
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: Hidden Dropout
hidden_size:
default_value_reasoning: Huggingface default.
description_implications:
Increasing the hidden size makes the model larger
and slower to train, increases the model's capacity to capture more complexity.
It also increases the chance of overfitting.
expected_impact: 2
suggested_values: 10 - 2048
suggested_values_reasoning:
Increasing the hidden size makes sense if the
model is underfitting. It's useful to train both smaller and larger models
to see how model capacity affects performance. This should only be explored
after the architecture of the model has been settled.
ui_display_name: Hidden Size
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
intermediate_size:
ui_display_name: null
layer_norm_eps:
ui_display_name: null
num_attention_heads:
ui_display_name: null
num_channels:
ui_display_name: null
num_hidden_layers:
ui_display_name: null
patch_size:
default_value_reasoning: Taken from ViT paper.
description_implications:
"The implications of the image patch size for this\
\ layer depend on other factors, such as the true resolution of the incoming\
\ image dataset. If the patch size is kept consistent but a higher resolution\
\ image is used as input, then the resulting chunked sequence of tokens\
\ will be longer than it would have been if the input resolution was lower.\
\ \n\nThe ViT paper notes that decreasing the patch size in this way led\
\ to robust improvements without introducing other parameters."
expected_impact: 2
literature_references:
- "[Huggingface docs](https://huggingface.co/docs/transformers/model_doc/vit)
[ViT paper](https://arxiv.org/abs/2010.11929)"
suggested_values:
- 16
- 32
suggested_values_reasoning:
16 and 32 are the values used in the original
ViT paper.
ui_display_name: Patch Size
pretrained_model:
default_value_reasoning:
The default model is the canonical model for this
model architecture, and is therefore a good starting point for most use
cases.
description_implications:
"There are two factors to consider when choosing\
\ a pre-trained model: (1) size, and (2) task similarity. \n\nThe larger\
\ the model, the more subtle its comprehension of inputs can become. However,\
\ larger models are also more compute and memory-intensive to train.\n\
\nModels pretrained on highly-related source tasks are more likely to\
\ be successful on the target task. Consider searching the HuggingFace\
\ model repository for models trained on similar tasks."
expected_impact: 3
literature_references:
- https://arxiv.org/abs/2010.11929
related_parameters:
- use_pretrained, trainable, pretrained_kwargs
suggested_values: google/vit-large-patch16-224
suggested_values_reasoning:
"If you would like better performance and are
not compute/memory-constrained, increasing model capacity can potentially
provide a richer representation than the default. The suggested value
upsizes the model while maintaining the same model architecture.
Model trained on internet-scale datasets typically generalize well. Consider
deviating from the default only if the images in the dataset originate
from another domain (e.g. medical images, geospatial data)."
ui_display_name: Pretrained model name
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
default_value_reasoning: By default, model components are trainable.
description_implications:
The tradeoff when using `trainable` is between speed
and flexibility. If False, less weights are subject to change and the
model will therefore train faster. However, the representations output
by this component are fixed for each input.
expected_impact: 3
literature_references:
- "https://www.ibm.com/cloud/learn/overfitting
http://d2l.ai/chapter_computer-vision/fine-tuning.html"
related_parameters:
- use_pretrained, pretrained_model, saved_weights_in_checkpoint
suggested_values:
- false
suggested_values_reasoning:
Freezing the weights (i.e. `trainable = False`)
is only worth trying if you are loading in pretrained weights. In that
case, check to see if your model is overfitting. If so, freezing the weights
(and therefore reducing model complexity) may be beneficial.
ui_display_name: Trainable
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
width:
internal_only: true
ui_display_name: null
XLM:
type:
short_description: XLM is pre-trained by cross-language modeling.
long_description:
The `xlm` encoder loads a pretrained [XLM](https://arxiv.org/abs/1901.07291) (default `xlm-mlm-en-2048`) model using the
Hugging Face transformers package. Pre-trained by cross-language modeling.
compute_tier: 2
asm:
ui_display_name: null
attention_dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: attention_dropout
bos_index:
ui_display_name: null
bos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: Beginning-of-Sentence Token Id
causal:
ui_display_name: null
dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- attention_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: dropout
emb_dim:
ui_display_name: null
embed_init_std:
ui_display_name: null
end_n_top:
ui_display_name: null
eos_index:
ui_display_name: null
gelu_activation:
ui_display_name: null
expected_impact: 1
init_std:
ui_display_name: null
is_encoder:
ui_display_name: null
lang_id:
ui_display_name: null
layer_norm_eps:
ui_display_name: null
mask_index:
ui_display_name: null
mask_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: Mask Token ID
max_position_embeddings:
default_value_reasoning: Taken from huggingface.
description_implications:
The size of the position embeddings table. This typically coincides with the
maximum sequence length this model might ever be used with. Typically set this
to something large just in case (e.g. 512, 1024, 2048).
expected_impact: 2
suggested_values: 512
suggested_values_reasoning:
Out of the box value based on published literature.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Max Position Embeddings
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
n_heads:
ui_display_name: null
n_langs:
default_value_reasoning: Default value used in pre-trained HF encoder.
expected_impact: 1
ui_display_name: Number of Languages
n_layers:
ui_display_name: null
pad_index:
ui_display_name: null
pad_token_id:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
sinusoidal_embeddings:
ui_display_name: null
start_n_top:
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
unk_index:
ui_display_name: null
use_lang_emb:
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
XLMRoBERTa:
type:
short_description: XLM-RoBERTa a large multi-lingual language model trained on 2.5TB of filtered CommonCrawl data.
long_description:
The `xlmroberta` encoder loads a pretrained [XLM-RoBERTa](https://arxiv.org/abs/1911.02116)
(default `jplu/tf-xlm-reoberta-base`) model using the Hugging Face transformers package. XLM-RoBERTa is a multi-language
model similar to BERT, trained on 100 languages. XLM-RoBERTa is based on Facebook’s RoBERTa model
released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered
CommonCrawl data.
compute_tier: 2
add_pooling_layer:
ui_display_name: null
bos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: Beginning-of-Sentence Token Id
eos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: End-of-Sentence Token Id
max_position_embeddings:
default_value_reasoning: Taken from huggingface.
description_implications:
The size of the position embeddings table. This typically coincides with the
maximum sequence length this model might ever be used with. Typically set this
to something large just in case (e.g. 512, 1024, 2048).
expected_impact: 1
suggested_values: 512
suggested_values_reasoning:
Out of the box value based on published literature.
Try models with smaller or larger embedding sizes to observe relative
impact.
ui_display_name: Max Position Embeddings
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
pad_token_id:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
type_vocab_size:
ui_display_name: null
expected_impact: 1
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
XLNet:
type:
short_description: XLNet is a transformer that outperforms BERT on a variety of benchmarks.
long_description:
The `xlnet` encoder loads a pretrained [XLNet](https://arxiv.org/abs/1906.08237) (default `xlnet-base-cased`) model
using the Hugging Face transformers package. XLnet is an extension of the Transformer-XL model pre-trained using
an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood
over all permutations of the input sequence factorization order. XLNet outperforms BERT on a
variety of benchmarks.
compute_tier: 2
attn_type:
ui_display_name: null
bi_data:
ui_display_name: null
bos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: Beginning-of-Sentence Token Id
clamp_len:
ui_display_name: null
d_inner:
ui_display_name: null
d_model:
ui_display_name: null
dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 2
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- summary_last_dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: dropout
end_n_top:
ui_display_name: null
eos_token_id:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: End-of-Sequence Token Id
ff_activation:
ui_display_name: null
expected_impact: 1
initializer_range:
description_implications:
There is an ideal value for this variable that doesn't
lead to the outputs of these matrices to vanish or explode
example_value:
- 0.02
expected_impact: 1
other_information: Must be greater than 0
related_parameters:
- weights_initializer
suggested_values: 0.01-0.05
suggested_values_reasoning:
Large values will likely lead to very large outputs.
Small values will lead to vanishing outputs.
ui_display_name: null
layer_norm_eps:
ui_display_name: null
max_sequence_length:
default_value_reasoning:
Sets the maximum sequence length of the expected
inputs, so input/output shapes are computed accurately.
internal_only: true
ui_display_name: null
mem_len:
ui_display_name: null
n_head:
ui_display_name: null
n_layer:
ui_display_name: null
pad_token_id:
ui_display_name: null
pretrained_kwargs:
ui_display_name: null
pretrained_model_name_or_path:
ui_display_name: null
expected_impact: 2
reduce_output:
ui_display_name: null
expected_impact: 1
reuse_len:
ui_display_name: null
same_length:
ui_display_name: null
saved_weights_in_checkpoint:
default_value_reasoning:
The weights of the encoder are not necessarily saved
in the checkpoint. The user has to save them first.
description_implications:
The memory footprint for some of these encoders
can be large.
internal_only: true
related_parameters:
- skip_save_model
suggested_values:
- false
suggested_values_reasoning:
Some of these encoders are large, so it might
be better to load them as needed, especially if 1. they're not used frequently
2. the user doesn't have a lot of storage.
ui_display_name: null
start_n_top:
ui_display_name: null
summary_activation:
default_value_reasoning: Default value used in pre-trained HF encoder.
ui_display_name: Summary Activation Function
expected_impact: 1
summary_last_dropout:
default_value_reasoning: Huggingface default.
description_implications:
"Dropout is a computationally cheap regularization\
\ method where during training, some neurons are randomly ignored or \u201C\
dropped out\u201D. Increasing dropout has the effect of making the training\
\ process more noisy and lowering overall network capacity, but it can\
\ be an effective regularization method to reduce overfitting and improve\
\ generalization."
example_value:
- 0.2
expected_impact: 1
literature_references:
- https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
related_parameters:
- dropout
suggested_values: 0.05 - 0.8
suggested_values_reasoning:
Tuning dropout is really something to be done
when all of the big choices about architecture have been settled. Consider
starting with 0.5 and adjusting the dropout depending on observed model
performance.
ui_display_name: summary_last_dropout
summary_type:
ui_display_name: null
summary_use_proj:
ui_display_name: null
trainable:
expected_impact: 3
ui_display_name: null
untie_r:
ui_display_name: null
use_mems_eval:
ui_display_name: null
use_mems_train:
ui_display_name: null
use_pretrained:
default_value_reasoning:
By default, the model is initialized as a pretrained
model.
description_implications:
Pretrained models have typically already learned
features that are difficult to learn from scratch. They are particularly
beneficial when training on small amounts of data.
expected_impact: 3
literature_references:
- https://machinelearningmastery.com/transfer-learning-for-deep-learning/
related_parameters:
- trainable, pretrained_model_name, pretrained_model_name_or_path, pretrained_kwargs
suggested_values:
- false
suggested_values_reasoning:
If you have a large amount of data and/or you
have data that differs from the typical distribution, then it might be
worth training the model from scratch.
ui_display_name: Use Pretrained
vocab:
default_value_reasoning:
Computed and passed along internally according to
preprocessing settings.
example_value:
- a
- b
- c
internal_only: true
ui_display_name: Not Displayed
vocab_size:
internal_only: true
ui_display_name: Not displayed
conv_params:
num_conv_layers:
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a large amount of data is provided, but
also makes the model more computationally expensive and potentially more
prone to overfitting.
expected_impact: 3
related_parameters:
- conv_layers
ui_display_name: Number of Convolutional Layers
conv_layers:
description_implications:
The more layers that are specified the deeper and
higher capacity the model will be. This makes it possible to potentially
achieve better performance when a large amount of data is provided, but
also makes the model more computationally expensive and potentially more
prone to overfitting.
expected_impact: 1
related_parameters:
- num_conv_layers
ui_display_name: Convolutional Layers
pool_function:
default_value_reasoning:
"Within a given sliding window (e.g. a \"patch\"\
\ of a 3-channel image), the maximum value for each channel is kept. All\
\ other values in the patch are discarded. Repeat this step for every\
\ patch and you have a more compact representation of the image. \n\n\
Intuitively, each patch encodes the features from a particular part of\
\ an image, and it is more informative to look at the most prominent features\
\ of an image than the average of all of them."
description_implications:
Both average and max pooling can achieve strong
performance.
expected_impact: 1
literature_references:
- "https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html
https://machinelearningmastery.com/pooling-layers-for-convolutional-neural-networks/"
suggested_values: Default
suggested_values_reasoning: "No"
ui_display_name: Pooling function
pool_size:
ui_display_name: null
expected_impact: 1
num_filters:
ui_display_name: null
filter_size:
ui_display_name: null
expected_impact: 2
UNetEncoder:
type:
short_description: The UNet encoder convolutional and max pool layers
long_description:
Stacks of two 2D convolutional layers with optional normalization
and relu activation, followed by a max pool layer in all but the
final level of the encoder.
compute_tier: 1
conv_norm:
expected_impact: 2
ui_display_name: Convolutional Normalization
height:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
internal_only: true
ui_display_name: NOT DISPLAYED
num_channels:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
internal_only: true
ui_display_name: NOT DISPLAYED
width:
default_value_reasoning:
Computed internally, automatically, based on image
data preprocessing.
internal_only: true
ui_display_name: NOT DISPLAYED
TimmEncoder:
model_name:
ui_display_name: Model Name
use_pretrained:
ui_display_name: Use Pretrained
saved_weights_in_checkpoint:
internal_only: true
ui_display_name: Saved Weights in Checkpoint
trainable:
ui_display_name: Trainable
TimmCAFormerEncoder:
model_name:
ui_display_name: Model Name
TimmConvFormerEncoder:
model_name:
ui_display_name: Model Name
TimmPoolFormerEncoder:
model_name:
ui_display_name: Model Name
================================================
FILE: ludwig/schema/metadata/configs/features.yaml
================================================
audio:
preprocessing:
audio_file_length_limit_in_s:
ui_display_name: null
expected_impact: 2
computed_fill_value:
internal_only: true
ui_display_name: null
fill_value:
ui_display_name: Fill Value
expected_impact: 2
in_memory:
ui_display_name: null
expected_impact: 1
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
expected_impact: 3
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
norm:
default_value_reasoning:
While batch normalization and layer normalization
usually lead to improvements, it can be useful to start with fewer
bells and whistles.
description_implications:
Normalization helps stabilize the learning process
and can have a regularizing effect that can help with generalization.
It's often suggested that with normalization, you can use a higher
learning rate.
example_value:
- batch
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
related_parameters:
- norm_params
suggested_values: '"batch" or "layer"'
suggested_values_reasoning:
Normalization tries to solve "internal covariate
shift" that comes from the changing distributions of the inputs to
layers deep in the network when weights are updated. For example,
batch normalization standardizes the inputs to a layer for each mini-batch.
Try out different normalizations to see if that helps with training
stability
ui_display_name: Normalization Type
num_fft_points:
ui_display_name: null
expected_impact: 1
num_filter_bands:
literature_references:
- "https://medium.com/analytics-vidhya/simplifying-audio-data-fft-stft-mfcc-for-machine-learning-and-deep-learning-443a2f962e0e "
related_parameters:
- window_length_in_s
- type
- window_shift_in_s
ui_display_name: Type
expected_impact: 1
padding_value:
ui_display_name: null
expected_impact: 1
type:
default_value_reasoning:
The default type fbank is set based on values
that we have tested and determined to be a good starting point for
audio feature preprocessing. This is not to say that it is the best
way to process every audio feature, it is just a good starting place
that performs well in general.
description_implications:
The different type of audio you select hear
will determine how your audio feature is preprocessed and transformed
into trainable data for the model.
example_value:
- stft
expected_impact: 3
literature_references:
- "https://medium.com/analytics-vidhya/simplifying-audio-data-fft-stft-mfcc-for-machine-learning-and-deep-learning-443a2f962e0e "
other_information:
Audio feature preprocessing depends heavily on the
type of audio data you are dealing with. The type of audio preprocessing
you will want to use will be dictated by the audio data you are dealing
with.
related_parameters:
- audio_file_length_limit_in_s
- norm
- padding_value
- in_memory
ui_display_name: Type
window_length_in_s:
literature_references:
- "https://medium.com/analytics-vidhya/simplifying-audio-data-fft-stft-mfcc-for-machine-learning-and-deep-learning-443a2f962e0e "
related_parameters:
- window_shift_in_s
- type
- num_filter_bands
ui_display_name: Window Length in Seconds
expected_impact: 2
window_shift_in_s:
literature_references:
- "https://medium.com/analytics-vidhya/simplifying-audio-data-fft-stft-mfcc-for-machine-learning-and-deep-learning-443a2f962e0e "
related_parameters:
- window_length_in_s
- type
- num_filter_bands
ui_display_name: Window Shift in Seconds
expected_impact: 2
window_type:
ui_display_name: null
expected_impact: 2
bag:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
fill_value:
ui_display_name: Fill Value
expected_impact: 2
lowercase:
ui_display_name: null
expected_impact: 2
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
expected_impact: 3
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
most_common:
default_value_reasoning:
If there are more than 10000 unique categories
in the data, it is likely that they will follow a long-tailed distribution
and the least common ones may not provide a lot of information
description_implications:
A smaller number will reduce the vocabulary,
making the embedding matrix smaller and reduce the memory footprint,
but will also collapse more tokens into the rare one, so the model
may perform worse when rare tokens appear in the data
example_value:
- 10000
expected_impact: 2
other_information: Specifying a vocab_file overrides this parameter
related_parameters:
- vocab_file, pretrained_embeddings
suggested_values:
A value that covers at least 95% of the tokens in the
data
suggested_values_reasoning:
Depending on the data distribution and how
important rare tokens are, 90%, 95% or 99% of the number of tokens
will leave out only very rare tokens that should not influence performance
substantially
ui_display_name: Most common (vocabulary size)
tokenizer:
ui_display_name: null
expected_impact: 3
binary:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
fallback_true_label:
description_implications:
Modeling performance should not be affected,
but the semantics of some binary metrics may change like for "false
positives", "false negatives", etc. if the true label is pinned to
the other value.
expected_impact: 2
ui_display_name: Fallback True Label
fill_value:
expected_impact: 2
ui_display_name: Fill Value
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
expected_impact: 3
calibration:
expected_impact: 3
dependencies:
expected_impact: 1
reduce_dependencies:
expected_impact: 1
reduce_input:
expected_impact: 1
threshold:
expected_impact: 3
category:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
fill_value:
expected_impact: 2
ui_display_name: Fill Value
lowercase:
ui_display_name: null
expected_impact: 2
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
expected_impact: 3
most_common:
default_value_reasoning:
If there are more than 10000 unique categories
in the data, it is likely that they will follow a long-tailed distribution
and the least common ones may not provide a lot of information
description_implications:
A smaller number will reduce the vocabulary,
making the embedding matrix smaller and reduce the memory footprint,
but will also collapse more tokens into the rare one, so the model
may perform worse when rare tokens appear in the data
example_value:
- 10000
expected_impact: 2
other_information: Specifying a vocab_file overrides this parameter
related_parameters:
- vocab_file, pretrained_embeddings
suggested_values:
A value that covers at least 95% of the tokens in the
data
suggested_values_reasoning:
Depending on the data distribution and how
important rare tokens are, 90%, 95% or 99% of the number of tokens
will leave out only very rare tokens that should not influence performance
substantially
ui_display_name: Most common (vocabulary size)
calibration:
expected_impact: 3
dependencies:
expected_impact: 1
reduce_dependencies:
expected_impact: 1
reduce_input:
expected_impact: 1
top_k:
expected_impact: 3
date:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
datetime_format:
default_value_reasoning:
Ludwig will try to infer the date format automatically,
but a specific format can be provided. The date string spec is the
same as the one described in python's datetime.
description_implications:
If Ludwig has trouble parsing dates, it could
be useful to specify an explicit format that Ludwig should parse date
feature values as. This could also serve as a form of normalization,
for example, if not all datetimes have the same granularity (some
have days, some have times), then the common format (i.e. %d %m %Y)
serves as a truncator.
example_value:
- "%d %b %Y"
expected_impact: 2
suggested_values_reasoning: Have Ludwig figure out the date format automatically.
ui_display_name: Datetime format
fill_value:
expected_impact: 2
ui_display_name: Fill Value
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
expected_impact: 3
h3:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
fill_value:
expected_impact: 2
ui_display_name: Fill Value
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
expected_impact: 3
image:
# TODO: review metadata generated by Copilot
augmentation:
auto_augmentation_method:
default_value_reasoning: Trivial augment is computationally more efficient than the other methods.
description_implications:
Type of auto-augmentation method to apply to batch of images to improve model generalization
example_value:
"trivial_augment"
expected_impact: 1
ui_display_name: Auto Augmentation Method
max_brightness:
default_value_reasoning: The default value of 3.0.
description_implications:
The maximum factor by which the brightness of
the image will be randomly changed.
example_value:
- 3.9
expected_impact: 1
ui_display_name: Maximum Brightness
min_brightness:
default_value_reasoning: The default value of 0.1.
description_implications:
The minimum brightness factor to apply to the
image.
example_value:
- 0.5
expected_impact: 1
ui_display_name: Minimum Brightness
max_contrast:
default_value_reasoning: The default value of 3.0
description_implications:
The maximum factor by which the contrast of
the image will be randomly changed.
example_value:
- 3.0
expected_impact: 1
ui_display_name: Maximum Contrast
min_contrast:
default_value_reasoning: The default value of 0.1.
description_implications:
The minimum contrast factor to apply to the
image.
example_value:
- 0.1
expected_impact: 1
ui_display_name: Minimum contrast
kernel_size:
default_value_reasoning: The default value is 3.
description_implications: The kernel size is the size of the filter
matrix. A larger kernel size will result in a blurrier image, while a
smaller kernel size will result in less blurring.
example_value:
- 3
expected_impact: 2
suggested_values:
- 3
- 5
- 7
suggested_values_reasoning:
The default value is 3, which is a common
value for image processing
ui_display_name: Kernel Size
rotation_degree:
default_value_reasoning: The default value of 15 means that the
image will be randomly rotated between -15 to +15 degrees.
description_implications: The degree of rotation to apply to the image.
expected_impact: 1
ui_display_name: Rotation Degree
type:
description_implications: The type of augmentation to perform on the
image.
expected_impact: 1
ui_display_name: Type
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
fill_value:
expected_impact: 2
ui_display_name: Fill Value
height:
ui_display_name: null
expected_impact: 2
in_memory:
ui_display_name: null
expected_impact: 1
infer_image_dimensions:
ui_display_name: null
expected_impact: 1
infer_image_max_height:
ui_display_name: null
expected_impact: 1
infer_image_max_width:
ui_display_name: null
expected_impact: 1
infer_image_num_channels:
ui_display_name: null
expected_impact: 1
infer_image_sample_size:
ui_display_name: null
expected_impact: 1
infer_image_num_classes:
ui_display_name: null
expected_impact: 1
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
expected_impact: 3
num_channels:
ui_display_name: null
expected_impact: 2
num_classes:
ui_display_name: null
expected_impact: 2
num_processes:
ui_display_name: null
expected_impact: 2
resize_method:
default_value_reasoning:
Interpolation may stretch or squish the image,
but it does not remove content or change the statistical distribution
of image values so it is more appropriate for most tasks.
description_implications:
"interpolation will not change the content of
the image, but it will change the aspect ratio.
crop_or_pad will preserve the aspect ratio of the image, but may remove
some content (in the case of cropping)."
expected_impact: 1
related_parameters:
- height, width
ui_display_name: Resize Method
standardize_image:
ui_display_name: null
expected_impact: 1
width:
ui_display_name: null
expected_impact: 2
requires_equal_dimensions:
ui_display_name: null
expected_impact: 1
dependencies:
expected_impact: 1
reduce_dependencies:
expected_impact: 1
reduce_input:
expected_impact: 1
number:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
computed_outlier_fill_value:
internal_only: true
ui_display_name: null
fill_value:
expected_impact: 2
ui_display_name: Fill Value
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
expected_impact: 3
outlier_strategy:
default_value_reasoning:
Outlier definitions and how to handle them are very task-specific, so we leave
this feature disabled by default and ask the user to choose the strategy that works best for them.
description_implications:
Determines how outliers will be handled in the dataset. In most cases replacing outliers with the
column mean (`fill_with_mean`) will be sufficient, but in others the outliers may be damaging enough
to merit dropping the entire row of data (`drop_row`). In some cases, the best way to handle outliers
is to leave them in the data, which is the behavior when this parameter is left as `null`.
related_parameters:
- outlier_threshold
suggested_values: fill_with_mean
ui_display_name: Outlier Strategy
expected_impact: 3
outlier_threshold:
default_value_reasoning:
The definition of an outlier is often dataset and task dependent, but 2 or 3 standard deviations from
the mean is a common heuristic.
description_implications:
"Determines the threshold past which a number will be considered an outlier in the dataset. The 3-sigma
rule in statistics tells us that when data is normally distributed, 95% of the data will lie within 2
standard deviations of the mean, and greater than 99% of the data will lie within 3 standard deviations
of the mean (see: 68–95–99.7 rule). As such anything farther away than that is highly likely to be an
outlier, and may distort the learning process by disproportionately affecting the model."
related_parameters:
- outlier_strategy
literature_references:
- https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule
suggested_values: 2 - 3
ui_display_name: Outlier Threshold
expected_impact: 2
normalization:
default_value_reasoning:
Z-score normalization helps improve the training stability and convergence
of neural networks by rescaling the numeric input features to have a mean
of 0 and a standard deviation of 1, reducing the variability and distribution
of the data. This improves neural network training.
description_implications:
The goal of normalization is to transform features
to be on a similar scale. Normalization can be a form of feature smoothing
that improves the performance and training stability of the model.
Normalizations may result in different effects on the semantics of
your number features. The best normalization technique is one that
empirically works well, so try new ideas if you think they'll work
well on your feature distribution.
expected_impact: 3
literature_references:
- https://developers.google.com/machine-learning/data-prep/transform/normalization
suggested_values: zscore
suggested_values_reasoning:
"Z-score is a variation of scaling that represents\
\ the number of standard deviations away from the mean. You would\
\ use z-score to ensure your feature distributions have mean = 0 and\
\ std = 1. It\u2019s useful when there are a few outliers, but not\
\ so extreme that you need clipping."
ui_display_name: Normalization
clip:
expected_impact: 2
dependencies:
expected_impact: 1
reduce_dependencies:
expected_impact: 1
reduce_input:
expected_impact: 1
sequence:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
fill_value:
expected_impact: 2
ui_display_name: Fill Value
lowercase:
ui_display_name: null
expected_impact: 2
sequence_length:
default_value_reasoning:
The default value is `None`. Which means that the sequence length will be inferred from the dataset,
which may save you compute resources on datasets with short sequence samples.
description_implications:
A larger sequence length keeps more information
from the data, but also makes it more computationally expensive (more
memory and longer training time). A smaller sequence length keeps
less information from the data, but also makes it less computationally
expensive (less memory and shorter training time).
expected_impact: 3
related_parameters:
- max_sequence_length
suggested_values:
If tying the weights of multiple sequence encoders together,
this parameter may need to be set to ensure that all sequence features have the same sequence length.
ui_display_name: Sequence Length
max_sequence_length:
default_value_reasoning:
The default value is 256. Every sequence will
be truncated to this length.
description_implications:
A larger sequence length keeps more information
from the data, but also makes it more computationally expensive (more
memory and longer training time). A smaller sequence length keeps
less information from the data, but also makes it less computationally
expensive (less memory and shorter training time).
expected_impact: 3
related_parameters:
- vocab_size, embedding_size
suggested_values:
Use the lowest value that covers most of your input
data. Only increase the value if crucial parts of the input data are
truncated.
ui_display_name: Maximum Sequence Length
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
expected_impact: 3
most_common:
default_value_reasoning:
If there are more than 10000 unique categories
in the data, it is likely that they will follow a long-tailed distribution
and the least common ones may not provide a lot of information
description_implications:
A smaller number will reduce the vocabulary,
making the embedding matrix smaller and reduce the memory footprint,
but will also collapse more tokens into the rare one, so the model
may perform worse when rare tokens appear in the data
example_value:
- 10000
expected_impact: 2
other_information: Specifying a vocab_file overrides this parameter
related_parameters:
- vocab_file, pretrained_embeddings
suggested_values:
A value that covers at least 95% of the tokens in the
data
suggested_values_reasoning:
Depending on the data distribution and how
important rare tokens are, 90%, 95% or 99% of the number of tokens
will leave out only very rare tokens that should not influence performance
substantially
ui_display_name: Most common (vocabulary size)
ngram_size:
default_value_reasoning: Size of the n-gram when using the `ngram` tokenizer.
example_value:
- 3
ui_display_name: n-gram size
expected_impact: 2
padding:
ui_display_name: null
expected_impact: 1
padding_symbol:
ui_display_name: null
expected_impact: 1
tokenizer:
ui_display_name: null
expected_impact: 3
unknown_symbol:
ui_display_name: null
expected_impact: 1
vocab_file:
default_value_reasoning:
The vocabulary can be parsed automatically from
the incoming input features.
description_implications:
It can be useful to specify your own vocabulary
list if the vocabulary is very large, there's no out of the box tokenizer
that fits your data, or if there are several uncommon or infrequently
occurring tokens that we want to guarantee to be a part of the vocabulary,
rather than treated as an unknown.
expected_impact: 0
ui_display_name: Vocab File
dependencies:
expected_impact: 1
reduce_dependencies:
expected_impact: 1
reduce_input:
expected_impact: 1
set:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
fill_value:
expected_impact: 2
ui_display_name: Fill Value
lowercase:
ui_display_name: null
expected_impact: 2
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
expected_impact: 3
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
most_common:
default_value_reasoning:
If there are more than 10000 unique categories
in the data, it is likely that they will follow a long-tailed distribution
and the least common ones may not provide a lot of information
description_implications:
A smaller number will reduce the vocabulary,
making the embedding matrix smaller and reduce the memory footprint,
but will also collapse more tokens into the rare one, so the model
may perform worse when rare tokens appear in the data
example_value:
- 10000
expected_impact: 2
other_information: Specifying a vocab_file overrides this parameter
related_parameters:
- vocab_file, pretrained_embeddings
suggested_values:
A value that covers at least 95% of the tokens in the
data
suggested_values_reasoning:
Depending on the data distribution and how
important rare tokens are, 90%, 95% or 99% of the number of tokens
will leave out only very rare tokens that should not influence performance
substantially
ui_display_name: Most common (vocabulary size)
tokenizer:
ui_display_name: null
expected_impact: 3
dependencies:
expected_impact: 1
reduce_dependencies:
expected_impact: 1
reduce_input:
expected_impact: 1
threshold:
expected_impact: 3
text:
preprocessing:
computed_fill_value:
example_value:
- Depends on dtype
internal_only: true
related_parameters:
- missing_value_strategy, fill_value
ui_display_name: DOCSTRING ONLY
fill_value:
expected_impact: 2
ui_display_name: Fill Value
lowercase:
default_value_reasoning:
Reading the text in lowercase enables the model
to treat capitalized and lowercase words as the same, effectively
increasing the number of data points per word.
description_implications:
If you set lowercase to False, then capitalized
words are seen as completely separate entities than lowercase words.
example_value:
- true
expected_impact: 2
related_parameters:
- vocab_size
suggested_values: "TRUE"
suggested_values_reasoning:
If there is a strong reason to treat capitalized
words and lowercased words differently, then set this to False. Otherwise,
it is preferable to bucket the words and make the model case-insensitive.
ui_display_name: Convert to lowercase
sequence_length:
default_value_reasoning:
The default value is `None`. Which means that the sequence length will be inferred from the dataset,
which may save you compute resources on datasets with short text samples.
description_implications:
A larger sequence length keeps more information
from the data, but also makes it more computationally expensive (more
memory and longer training time). A smaller sequence length keeps
less information from the data, but also makes it less computationally
expensive (less memory and shorter training time).
expected_impact: 3
related_parameters:
- max_sequence_length
suggested_values:
If tying the weights of multiple text encoders together,
this parameter may need to be set to ensure that all text features have the same sequence length.
ui_display_name: Sequence Length
max_sequence_length:
default_value_reasoning:
The default value is 256. Every sequence will
be truncated to this length.
description_implications:
A larger sequence length keeps more information
from the data, but also makes it more computationally expensive (more
memory and longer training time). A smaller sequence length keeps
less information from the data, but also makes it less computationally
expensive (less memory and shorter training time).
expected_impact: 3
related_parameters:
- vocab_size, embedding_size
suggested_values:
Use the lowest value that covers most of your input
data. Only increase the value if crucial parts of the input data are
truncated.
ui_display_name: Maximum Sequence Length
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
expected_impact: 3
most_common:
default_value_reasoning:
If there are more than 10000 unique categories
in the data, it is likely that they will follow a long-tailed distribution
and the least common ones may not provide a lot of information
description_implications:
A smaller number will reduce the vocabulary,
making the embedding matrix smaller and reduce the memory footprint,
but will also collapse more tokens into the rare one, so the model
may perform worse when rare tokens appear in the data
example_value:
- 10000
expected_impact: 2
other_information: Specifying a vocab_file overrides this parameter
related_parameters:
- vocab_file, pretrained_embeddings
suggested_values:
A value that covers at least 95% of the tokens in the
data
suggested_values_reasoning:
Depending on the data distribution and how
important rare tokens are, 90%, 95% or 99% of the number of tokens
will leave out only very rare tokens that should not influence performance
substantially
ui_display_name: Most common (vocabulary size)
ngram_size:
default_value_reasoning: Size of the n-gram when using the `ngram` tokenizer.
example_value:
- 3
ui_display_name: n-gram size
expected_impact: 2
padding:
default_value_reasoning:
We usually want to add padding to the end of
a text sequence to fill in any remaining space as opposed to the beggining
so we set the default to right.
description_implications:
If you pad to the left, the encoded vector will
have leading padding tokens as opposed to trailing padding tokens.
This could matter based on the type of text input you are expecting.
expected_impact: 1
related_parameters:
- "padding_symbol,
max_sequence_length"
suggested_values: "'right'"
suggested_values_reasoning:
right padding is the usual way to add padding
to a text sequence
ui_display_name: Padding
padding_symbol:
ui_display_name: null
expected_impact: 1
pretrained_model_name_or_path:
internal_only: true
ui_display_name: null
expected_impact: 0
tokenizer:
default_value_reasoning:
'The default tokenizer is `space_punct`, an abbreviation
of "Space punctuation". This tokenizer creates sub-words by dividing
the text on whitespace and punctuation characters. For example: The
text `''hello world!isn''t this great?''` would be transformed to
`[''hello'', ''world'', ''!'', ''isn'', "''", ''t'', ''this'', ''great'',
''?'']`. This is the default value because it is a fast tokenizer
that works reasonably well.'
description_implications:
Choosing a tokenizer can be difficult. The primary
thing to check is that the tokenizer you have selected is compatible
with the language(s) in your text data. This means either selecting
a tokenizer that is language-specific (i.e. `french_tokenize` if working
with French text) or general enough that its tokenizations are language-agnostic
(i.e. `space_punct`).
example_value:
- space_punct
expected_impact: 3
literature_references:
- https://huggingface.co/course/chapter2/4?fw=pt
related_parameters:
- vocab_file, pretrained_model_name_or_path
suggested_values: sentencepiece
suggested_values_reasoning:
"SentencePiece is a tokenizer developed by
Google which utilizes Byte-Pair Encoding (BPE), which strikes a good
balance between character-level and word-level tokenization (more
info on BPE here: https://towardsdatascience.com/byte-pair-encoding-the-dark-horse-of-modern-nlp-eb36c7df4f10
). This tokenizer is language-agnostic and more sophisticated than
the default."
ui_display_name: Tokenizer
unknown_symbol:
ui_display_name: null
expected_impact: 1
vocab_file:
default_value_reasoning:
The vocabulary can be parsed automatically from
the incoming input features.
description_implications:
It can be useful to specify your own vocabulary
list if the vocabulary is very large, there's no out of the box tokenizer
that fits your data, or if there are several uncommon or infrequently
occurring tokens that we want to guarantee to be a part of the vocabulary,
rather than treated as an unknown.
expected_impact: 0
ui_display_name: Vocab File
class_similarities:
expected_impact: 1
dependencies:
expected_impact: 1
reduce_dependencies:
expected_impact: 1
reduce_input:
expected_impact: 1
timeseries:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
fill_value:
expected_impact: 2
ui_display_name: Fill Value
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
expected_impact: 3
padding:
ui_display_name: null
expected_impact: 1
padding_value:
ui_display_name: null
expected_impact: 1
timeseries_length_limit:
ui_display_name: null
expected_impact: 2
tokenizer:
ui_display_name: null
expected_impact: 3
vector:
preprocessing:
computed_fill_value:
internal_only: true
ui_display_name: null
fill_value:
expected_impact: 2
ui_display_name: Fill Value
missing_value_strategy:
default_value_reasoning:
The default `fill_with_const` replaces missing
values with the value specified by `fill_value`.
description_implications:
Determines how missing values will be handled
in the dataset. Not all strategies are valid for all datatypes. For
example, `fill_with_mean` is applicable to continuous numerical data.
Note that choosing to drop rows with missing values could result in
losing information, especially if there is a high proportion of missing
values in the dataset.
expected_impact: 3
related_parameters:
- fill_value
ui_display_name: Missing Value Strategy
vector_size:
ui_display_name: null
expected_impact: 3
dependencies:
expected_impact: 1
reduce_dependencies:
expected_impact: 1
reduce_input:
expected_impact: 1
softmax:
expected_impact: 3
vector_size:
expected_impact: 3
================================================
FILE: ludwig/schema/metadata/configs/llm.yaml
================================================
base_model:
_anyOf:
preset:
ui_display_name: Preset
expected_impact: 3
custom:
ui_display_name: Custom
expected_impact: 3
_meta:
ui_display_name: Model Name
expected_impact: 3
ui_component_type: radio_string_combined
short_description: This can be one of the presets or a fully qualified name of a pretrained model from the HuggingFace Hub
generation:
temperature:
ui_display_name: Temperature
default_value_reasoning:
Increasing the temperature will allow the model to generate more diverse sequences,
but will also increase the likelihood of generating nonsense. As such, we recommend setting this value to
something closer to 0 for classification tasks, and something closer to 1 for text generation tasks where the
goal is to generate novel text.
expected_impact: 3
max_new_tokens:
ui_display_name: Max New Tokens
default_value_reasoning:
Increasing the maximum number of new tokens will allow the model
to generate longer sequences, but because inference time scales linearly with the sequence length,
longer sequences will be much slower to generate. For classification tasks, it's generally better to
use a smaller number of new tokens, while for text generation tasks, it's generally better to use a larger
number of new tokens.
expected_impact: 3
num_beams:
ui_display_name: Number of Beams
default_value_reasoning:
Increasing the number of beams will allow the model to generate more diverse sequences,
but will also increase inference time. Some backends (like DeepSpeed) also do not support beam search.
As such, we recommend leaving this as 1 in most cases, unless you're finding the quality of the generated
sequences to be lacking.
expected_impact: 2
top_k:
ui_display_name: Top K
expected_impact: 2
top_p:
ui_display_name: Top P
expected_impact: 2
max_length:
ui_display_name: Max Length
expected_impact: 2
min_length:
ui_display_name: Min Length
expected_impact: 2
min_new_tokens:
ui_display_name: Min New Tokens
expected_impact: 2
do_sample:
ui_display_name: Do Sample
expected_impact: 2
use_cache:
ui_display_name: Use Cache
expected_impact: 2
prompt_lookup_num_tokens:
ui_display_name: Prompt Lookup Num Tokens
expected_impact: 2
prompt:
retrieval:
type:
ui_display_name: Type
expected_impact: 3
index_name:
ui_display_name: Index Name
expected_impact: 2
model_name:
ui_display_name: Model Name
expected_impact: 2
k:
ui_display_name: Top K
expected_impact: 2
task:
ui_display_name: Task
ui_component_type: textarea
expected_impact: 3
template:
ui_display_name: Template
ui_component_type: textarea
expected_impact: 3
adapter:
_oneOf:
allOf:
ui_display_name: Perform parameter efficient fine-tuning
expected_impact: 3
none:
ui_display_name: Disabled
expected_impact: 3
_meta:
expected_impact: 3
ui_component_type: radio_string_combined
lora:
type:
long_description: |
LoRA is a simple, yet effective, method for parameter-efficient fine-tuning of pretrained language models.
It works by adding a small number of trainable parameters to the model, which are used to adapt the
pretrained parameters to the downstream task. This allows the model to be fine-tuned with a much smaller
number of training examples, and can even be used to fine-tune models on tasks that have no training data
available at all.
r:
ui_display_name: R
expected_impact: 3
alpha:
ui_display_name: Alpha
expected_impact: 1
dropout:
ui_display_name: Dropout
expected_impact: 2
target_modules:
ui_display_name: Target Modules
expected_impact: 2
use_rslora:
ui_display_name: Enable RSLora
expected_impact: 2
use_dora:
ui_display_name: Enable DoRa
expected_impact: 2
adalora:
type:
long_description: |
AdaLoRA is an extension of LoRA that allows the model to adapt the pretrained parameters to the downstream
task in a task-specific manner. This is done by adding a small number of trainable parameters to the model,
which are used to adapt the pretrained parameters to the downstream task. This allows the model to be
fine-tuned with a much smaller number of training examples, and can even be used to fine-tune models on tasks
that have no training data available at all.
prompt_learning:
num_virtual_tokens:
ui_display_name: Num Virtual Tokens
expected_impact: 3
prompt_tuning:
prompt_tuning_init:
ui_display_name: Prompt Tuning Init
expected_impact: 2
prompt_tuning_init_text:
ui_display_name: Prompt Tuning Init Text
expected_impact: 2
adaption_prompt:
type:
long_description: |
Adaption Prompt is taken from the paper
[LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention](https://arxiv.org/pdf/2303.16199.pdf).
It adds a set of learnable adaption prompts and prepends them to the word tokens at higher transformer layers.
Then, a zero-initialized attention mechanism with zero gating is introduced, which adaptively injects
new instructional cues into LLaMA, while effectively preserving its pre-trained knowledge. According to
the paper, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned
7B parameters.
adapter_len:
ui_display_name: Adapter Length
expected_impact: 3
adapter_layers:
ui_display_name: Adapter Layers
expected_impact: 3
ia3:
type:
long_description: |
[Infused Adapter by Inhibiting and Amplifying Inner Activations](https://arxiv.org/pdf/2205.05638.pdf), or IA3,
is a method that adds three learned vectors `l_k``, `l_v``, and `l_ff`, to rescale the keys and values of the self-attention and encoder-decoder attention layers, and the intermediate activation of the position-wise feed-forward network respectively. These learned vectors are the only trainable parameters during fine-tuning, and thus the original weights remain frozen. Dealing with learned vectors (as opposed to learned low-rank updates to a weight matrix like LoRA) keeps the number of trainable parameters much smaller.
target_modules:
ui_display_name: Target Modules
expected_impact: 3
feedforward_modules:
ui_display_name: Feedforward Modules
expected_impact: 3
fan_in_fan_out:
ui_display_name: Fan In Fan Out
expected_impact: 3
modules_to_save:
ui_display_name: Modules to Save
expected_impact: 3
init_ia3_weights:
ui_display_name: Init IA3 Weights
expected_impact: 3
quantization:
_oneOf:
object:
ui_display_name: Quantization
expected_impact: 3
none:
ui_display_name: No Quantization
expected_impact: 3
_meta:
expected_impact: 3
ui_component_type: radio_string_combined
bits:
ui_display_name: Bits per parameter
expected_impact: 3
================================================
FILE: ludwig/schema/metadata/configs/loss.yaml
================================================
MSELoss:
weight:
expected_impact: 2
MAELoss:
weight:
expected_impact: 2
RMSELoss:
weight:
expected_impact: 2
RMSPELoss:
weight:
expected_impact: 2
BWCEWLoss:
positive_class_weight:
expected_impact: 3
robust_lambda:
expected_impact: 2
confidence_penalty:
expected_impact: 2
weight:
expected_impact: 2
SoftmaxCrossEntropyLoss:
class_weights:
expected_impact: 3
robust_lambda:
expected_impact: 2
confidence_penalty:
expected_impact: 2
class_similarities:
expected_impact: 2
class_similarities_temperature:
expected_impact: 2
weight:
expected_impact: 2
SequenceSoftmaxCrossEntropyLoss:
class_weights:
expected_impact: 3
robust_lambda:
expected_impact: 2
confidence_penalty:
expected_impact: 2
class_similarities:
expected_impact: 2
class_similarities_temperature:
expected_impact: 2
weight:
expected_impact: 2
unique:
expected_impact: 2
SigmoidCrossEntropyLoss:
class_weights:
expected_impact: 3
weight:
expected_impact: 2
================================================
FILE: ludwig/schema/metadata/configs/optimizers.yaml
================================================
gradient_clipping:
default_value_reasoning:
A conservative cap on the maximum gradient size to apply
over a single training step.
description_implications:
Gradient clipping is a technique to prevent exploding
gradients in very deep networks. Increasing gradient clipping can help with
model training loss curve stability, but it can also make training less efficient
as weight at each training step is capped.
expected_impact: 1
suggested_values_reasoning:
It's usually sensible to have some conservative notion
of gradient clipping to make modeling robust to a particularly bad or noisy
batch of examples.
ui_display_name: Gradient Clipping
momentum:
expected_impact: 1
weight_decay:
expected_impact: 1
dampening:
expected_impact: 1
nesterov:
expected_impact: 1
max_iter:
expected_impact: 1
max_eval:
expected_impact: 1
tolerance_grad:
expected_impact: 1
tolerance_change:
expected_impact: 1
history_size:
expected_impact: 1
line_search_fn:
expected_impact: 1
betas:
expected_impact: 1
amsgrad:
expected_impact: 1
rho:
expected_impact: 1
initial_accumulator_value:
expected_impact: 1
lr_decay:
expected_impact: 1
learning_rate_power:
expected_impact: 1
l1_regularization_strength:
expected_impact: 1
l2_regularization_strength:
expected_impact: 1
momentum_decay:
expected_impact: 1
alpha:
expected_impact: 1
eps:
expected_impact: 1
centered:
expected_impact: 1
================================================
FILE: ludwig/schema/metadata/configs/preprocessing.yaml
================================================
force_split:
default_value_reasoning:
We do not expect most datasets to have an explicit "split"
column in the data. Used mostly internally by ludwig datasets.
expected_impact: 3
related_parameters:
- split_probabilities, stratify
ui_display_name: Force Split
oversample_minority:
default_value_reasoning:
We do not want to randomly oversample by default since
this is a strategy to deal with imbalanced datasets, but can cause issues
if not implemented correctly.
description_implications:
The higher the value you choose gets to 1, the closer
you will be to having an equal imbalance ratio (i.e. 1:1 positive to negative
class), however this can lead to problems of overfitting when oversampling
is used too liberally. As a rule of thumb, starting oversampling with a very
conservative approach and increasing in small incremements is probably the
best way to improve your model without experiencing model overfitting.
example_value:
- 0.5
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/
other_information:
This parameter is one of many strategies to combat issues with
class imbalance, though it is not a cure all. Oversampling too much can cause
overfitting which can adversely affect your model so use with caution.
suggested_values: Depends on imbalance ratio and dataset size
ui_display_name: Oversample Minority
sample_ratio:
default_value_reasoning:
The default value is 1.0 because we do not want to shrink
the dataset by default. In the rare occurences when you do want to downsample
the entire dataset, this parameter is available, however it is not enabled
by default, hence a default value of 1.0
description_implications:
Decreases the amount of data you are inputting into
the model. Could be useful if you have more data than you need and you are
concerned with computational costs.
example_value:
- 0.8
expected_impact: 2
suggested_values: Depends on data size
ui_display_name: Sample Ratio
sample_size:
default_value_reasoning:
The default value is None because we do not want to shrink
the dataset by default, and we do not know the size of an arbitrary dataset.
By setting the default to None, we fall back on the sample_ratio to determine
the size of the dataset.
description_implications:
Decreases the amount of data you are inputting into
the model. Could be useful if you have more data than you need and you are
concerned with computational costs. More useful than sample_ratio if you
know the exact number of samples you want to train on instead of knowing the proportion.
example_value:
- 1000
expected_impact: 2
suggested_values: Depends on data size
ui_display_name: Sample Size
column:
expected_impact: 3
ui_display_name: Split Column
ui_component_type: column_selector
split_probabilities:
default_value_reasoning:
Most of the dataset should be used for training, with
some portion heldout for validation and testing.
description_implications:
"In machine learning, data splitting is typically done
to avoid overfitting. That is an instance where a machine learning model fits
its training data too well and fails to reliably fit additional data.
The training set is the portion of data used to train the model. The model
should observe and learn from the training set, optimizing any of its parameters.
The dev set is a data set of examples used to change learning process parameters.
It is also called the cross-validation or model validation set. This set of
data has the goal of ranking the model's accuracy and can help with model
selection.
The testing set is the portion of data that is tested in the final model and
is compared against the previous sets of data. The testing set acts as an
evaluation of the final mode and algorithm."
expected_impact: 3
literature_references:
- "https://www.techtarget.com/searchenterpriseai/definition/data-splitting#:~:text=Data%20splitting%20is%20when%20data,creating%20models%20based%20on%20data. "
other_information: "Split data into train, validation, and test.
By default, Ludwig looks for a column named split (case-sensitive) which is
expected to consist of 3 possible values that correspond to different datasets:
0: train
1: validation
2: test
If the data does not contain the split column, then data is randomly split
based on splitting percentages, defined by split_probabilities.
If force_split is true, the the split column in the dataset is ignored and
the dataset is randomly split based on splitting percentages, defined by split_probabilities."
related_parameters:
- force_split, stratify
suggested_values:
- 0.8
- 0.1
- 0.1
suggested_values_reasoning:
For larger datasets, it can be beneficial to use more
data for training, since the test and validation sets are still plenty big
for getting a good sense of model generalization.
ui_display_name: Split Probabilities
stratify:
default_value_reasoning:
The default is set to None since we do not want to stratify
unless specifically told to do so. There are a variety of reasons for this,
but one example is that our data set may not even have a categorical feature
to stratify on.
description_implications: Depends on dataset
example_value:
- Category_Feature_A
expected_impact: 3
literature_references:
- https://medium.com/analytics-vidhya/stratified-sampling-in-machine-learning-f5112b5b9cfe
related_parameters:
- force_split, split_probabilities
suggested_values: Depends on dataset
ui_display_name: Stratify
undersample_majority:
default_value_reasoning:
We do not want to randomly undersample by default since
this is a strategy to deal with imbalanced datasets, but can cause issues
if not implemented correctly.
description_implications:
The higher the value you choose gets to 1, the closer
you will be to having an equal imbalance ratio (i.e. 1:1 positive to negative
class), however this can lead to problems of data loss when undersampling
is used too liberally. As a rule of thumb, starting undersampling with a very
conservative approach and increasing in small incremements is probably the
best way to improve your model without experiencing catastrophic data loss
effects.
example_value:
- 0.5
expected_impact: 2
literature_references:
- https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/
other_information:
This parameter is one of many strategies to combat issues with
class imbalance, though it is not a cure all. Undersampling too much can cause
loss of data which can adversely affect your model so use with caution.
suggested_values: Depends on imbalance ratio and dataset size
ui_display_name: Undersample Majority
cache_encoder_embeddings:
default_value_reasoning:
Caching encoder embeddings means preprocessed data is not reusable across other model architectures, so
it's not always the case that you would always want to enable it when possible.
expected_impact: 1
ui_display_name: Cache Encoder Embeddings
global_max_sequence_length:
expected_impact: 2
ui_display_name: Global Max Sequence Length
description_implications:
Specifically for LLMs. This is the maximum number of tokens going into the model's forward pass during training. Sequences will be truncated to this length after merging the tokens from the input with tokens from the target. If not set, the total length of the merged input and target token sequences will be used.
example_value:
- 512
================================================
FILE: ludwig/schema/metadata/configs/trainer.yaml
================================================
ecd:
effective_batch_size:
commonly_used: true
expected_impact: 2
related_parameters:
- batch_size
suggested_values: auto
ui_display_name: Effective Batch Size
batch_size:
commonly_used: true
default_value_reasoning: Not too big, not too small.
description_implications:
There's conflicting evidence about what batch size to
use. Using a higher batch size will achieve the highest throughput and training
efficiency. However, there's also evidence that depending on other hyperparameters,
a smaller batch size may produce a higher quality model.
Batch size and learning rate are strongly intertwined,
so a commonly adopted strategy to set them is to find a the largest batch size
that allows the training process not to run out of memory,
and then find the best learning rate that makes the training converge
with that batch size.
expected_impact: 3
related_parameters:
- eval_batch_size
- learning_rate
suggested_values: auto
suggested_values_reasoning:
Auto batch size will determine the largest batch size that allows
the training process not to run out of memory.
Alternatively, try at least a few different batch sizes to get a
sense of whether and how batch size affects model performance.
ui_display_name: Batch Size
bucketing_field:
expected_impact: 1
other_information:
When not null, when creating batches, instead of shuffling
randomly, the length along the last dimension of the matrix of the specified
input feature (i.e. the length of a sequence or text)
is used for bucketing examples and then randomly shuffled examples
from the same bin are sampled. Padding is trimmed to the longest example in
the batch. The specified feature should be either a sequence or text feature
and the encoder encoding it has to be rnn. When used, bucketing improves speed
of rnn encoding up to 1.5x, depending on the length distribution of the inputs.
ui_display_name: Bucketing Field
checkpoints_per_epoch:
default_value_reasoning:
Per-epoch behavior, which scales according to the dataset
size.
description_implications:
"Epoch-based evaluation (using the default: 0) is an
appropriate fit for small datasets that fit in memory and
train quickly. Commonly available tabular datasets fit in this cateogry.
However, this is a poor fit for unstructured datasets, which tend to be much
larger, and train more slowly due to larger models.
It's important to setup evaluation such that you do not wait several hours
before getting a single evaluation result. In general, it is not necessary
for models to train over the entirety of a dataset, nor evaluate over the
entirety of a test set, to produce useful monitoring metrics and signals to
indicate model health.
It is also more engaging and more valuable to ensure a frequent pulse of evaluation
metrics, even if they are partial."
expected_impact: 2
related_parameters:
- train_steps
- steps_per_checkpoint
suggested_values: 2 - 10, for larger datasets
suggested_values_reasoning:
Running evaluation too frequently can be wasteful
while running evaluation not frequently enough can be prohibitively uninformative.
In many large-scale training runs, evaluation is often configured to run on
a sub-epoch time scale, or every few thousand steps.
ui_display_name: Checkpoints per epoch
layers_to_freeze_regex:
default_value_reasoning:
By default no layers will be frozen when fine-tuning a pretrained model.
description_implications:
Freezing specific layers can improve a pretrained model's performance in a number
of ways. At a basic level, freezing early layers can prevent overfitting by retaining
more general features (beneficial for small datasets). Also can reduce computational
resource use and lower overall training time due to less gradient calculations.
expected_impact: 1
early_stop:
default_value_reasoning:
Deep learning models are prone to overfitting. It's generally
a good policy to set up some early stopping criteria as it's not useful to
have a model train after it's maximized what it can learn. 5 consecutive rounds
of evaluation where there hasn't been any improvement on the validation set
(including chance) is a reasonable policy to start with.
description_implications:
Decreasing this value is a more aggressive policy. Decreasing
early stopping makes model training less forgiving, as the model has less
runway to demonstrate consecutive metric improvements before the training
run is quit. This can be efficient for pruning bad models earlier, but since
the training process is inherently non-deterministic and noisy, sometimes
improvements happen very gradually over a long period of time.
Extending this value leads to longer training times,
but potentially also better final performance.
expected_impact: 3
related_parameters:
- epochs
- train_steps
suggested_values: 5 - 10
suggested_values_reasoning:
There's potentially a lot of randomness in how models
train, but so many consecutive rounds of no improvement is usually a good
indicator that the model converged or overfitted.
ui_display_name: Early Stop
epochs:
default_value_reasoning:
A very high training length ceiling. Models will almost
always hit early stopping criteria before hitting a 100-epoch ceiling.
description_implications:
Decreasing this will shorten the overall runway for
training the model.
expected_impact: 3
related_parameters:
- train_steps
suggested_values: 100
suggested_values_reasoning:
Usually it's sensible to leave this very high and
rely on a solid early stopping policy to dictate when the model should stop
training. Some models and hyperparameter configurations require many epochs
through the dataset to converge while others converge before a single epoch
through the data.
ui_display_name: Epochs
eval_batch_size:
default_value_reasoning: Use the same batch size used for training.
description_implications:
By increasing the `eval_batch_size` past the `batch_size`
parameter set value, you allow for more parallelism in the batch evaluation
step and speed up evaluation. For example, if you have to evaluate the model
on a test set of size 1000, it is faster to evaluate two times with two batches
of size 500 as opposed to ten times with ten batches of 100.
Setting this parameter higher without getting past out memory limits
will speed up the model training process overall.
example_value:
- 512
expected_impact: 1
other_information:
Should only set the eval_batch_size to a level that you can fit
in memory.
related_parameters:
- batch_size
suggested_values:
- 256
- 512
- 1024
suggested_values_reasoning:
By observing memory consumption on training jobs,
you can get a sense of how much extra memory is available for increasing this
value. A good rule of thumb can be experimentally doubling the eval batch
size if you do not have insight into memory usage.
ui_display_name: Evaluation Batch Size
evaluate_training_set:
default_value_reasoning:
It could be useful to monitor evaluation metrics on the
training set to understand convergence.
description_implications:
Running evaluation on the full training set, when your
training set is large, can be a huge computational cost. Turning off training
set evaluation will lead to significant gains in training throughput and efficiency.
For small datasets that train and evaluate quickly, the choice is trivial.
expected_impact: 1
suggested_values: false
suggested_values_reasoning:
Running full-scale evaluation on the full training
set doesn't usually provide any useful information over the validation dataset.
Even with this set to False, continuous training loss metrics are still computed,
so it will still be easy to spot signs of overfitting like when the training-validation
loss curves diverge.
ui_display_name: Evaluate Training Set
gradient_clipping:
default_value_reasoning:
A conservative cap on the maximum gradient size to apply
over a single training step.
description_implications:
Gradient clipping is a technique to prevent exploding
gradients in very deep networks. Increasing gradient clipping can help with
model training loss curve stability, but it can also make training slower
as weights may not be updated as fast.
expected_impact: 2
suggested_values_reasoning:
It's usually sensible to enable gradient clipping to make modeling robust
to particularly bad or noisy batches of examples.
ui_display_name: Gradient Clipping
increase_batch_size_eval_metric:
expected_impact: 1
ui_display_name: "Batch Size Increase: Evaluation Metric"
increase_batch_size_eval_split:
expected_impact: 1
ui_display_name: "Batch Size Increase: Evaluation Split"
increase_batch_size_on_plateau:
expected_impact: 1
ui_display_name: Batch Size Increase On Plateau
increase_batch_size_on_plateau_patience:
expected_impact: 1
ui_display_name: "Batch Size Increase On Plateau: Patience"
increase_batch_size_on_plateau_rate:
expected_impact: 1
ui_display_name: "Batch Size Increase On Plateau: Rate"
learning_rate:
commonly_used: true
default_value_reasoning: Middle of the road learning rate to start with.
description_implications:
The learning rate is a hyperparameter that controls
how much to change the model in response to the estimated error each time
the model weights are updated. Increasing the learning rate may decrease learning
curve stability but also increase learning speed and efficiency, leading to
faster model convergence. Decreasing the learning rate can help stabilize
learning curves at the cost of slower time to convergence.
expected_impact: 3
suggested_values: 0.00001 - 0.1 or auto
related_parameters:
- decay
suggested_values_reasoning:
Tabular models trained from scratch typically use
learning rates around 1e-3 while learning rates for pre-trained models should
be much smaller, typically around 1e-5, which is important to mitigate catastrophic
forgetting. To make the model more robust to any specific choice of learning
rate, consider turning enabling learning rate decay.
ui_display_name: Learning Rate
learning_rate_scaling:
default_value_reasoning:
Traditionally the learning rate is scaled linearly with
the number of workers to reflect the proportion by which the effective batch
size is increased.
description_implications:
Traditionally the learning rate is scaled linearly with
the number of workers to reflect the proportion by which the effective batch
size is increased. For very large batch sizes, a softer square-root scale
can sometimes lead to better model performance. If the learning rate is hand-tuned
for a given number of workers, setting this value to constant can be used
to disable scale-up.
expected_impact: 1
suggested_values: linear or sqrt
suggested_values_reasoning:
Traditionally the learning rate is scaled linearly
with the number of workers to reflect the proportion by which the effective
batch size is increased. For very large batch sizes, a softer square-root
scale can sometimes lead to better model performance. If the learning rate
is hand-tuned for a given number of workers, setting this value to constant
can be used to disable scale-up.
ui_display_name: Learning Rate Scaling
max_batch_size:
default_value_reasoning: Not typically required.
description_implications:
Value used to manually limit the batch sizes explored
by auto batch size tuning and batch size increasing on plateau.
example_value:
- 1024
expected_impact: 1
related_parameters:
- batch_size
- increase_batch_size_on_plateau
ui_display_name: Max Batch Size
optimizer:
default_value_reasoning:
First try Adam because it is shown to return good
results without an advanced fine tuning.
description_implications:
"Choosing a good optimizer for your machine learning
project can be overwhelming. Popular deep learning libraries such as PyTorch
or TensorFLow offer a broad selection of different optimizers, each
with its own strengths and weaknesses. However, picking the wrong optimizer
can have a substantial negative impact on the performance of your machine
learning model [1][2]. This makes optimizers a critical design choice in
the process of building, testing, and deploying your machine learning model."
expected_impact: 3
literature_references:
- https://www.youtube.com/watch?v=mdKjMPmcWjY
suggested_values: adam, adamw
suggested_values_reasoning:
"As a rule of thumb: If you have the resources to
find a good learning rate schedule, SGD with momentum is a solid choice. If
you are in need of quick results without extensive hyperparameter tuning,
adaptive gradient methods like adam or adamw are good choices."
ui_display_name: Optimizer
regularization_lambda:
default_value_reasoning:
How to tune the overall impact of the regularization
term by multiplying its value by a scalar known as lambda (also called the
regularization rate).
description_implications:
"When choosing a lambda value, the goal is to strike
the right balance between simplicity and training-data fit:
If your lambda value is too high, your model will be simple, but you run the
risk of underfitting your data. Your model won't learn enough about the training
data to make useful predictions.
If your lambda value is too low, your model will be more complex, and you
run the risk of overfitting your data. Your model will learn too much about
the particularities of the training data, and won't be able to generalize
to new data. The ideal value of lambda produces a model that generalizes well
to new, previously unseen data. Unfortunately, that ideal value of lambda
is data-dependent, so you'll need to do some tuning. We recommend trying
a handful of values (0.001, 0.02, ... 0.4) gradually increasing the value until
training curves get worse"
expected_impact: 2
literature_references:
- "https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/lambda "
related_parameters:
- regularization_type
suggested_values: 0.1
suggested_values_reasoning:
"The most common type of regularization is L2, also
called weight decay, with values often on a logarithmic
scale between 0 and 0.1, such as 0.1, 0.001, 0.0001, etc."
ui_display_name: Regularization Lambda
regularization_type:
default_value_reasoning: L2 is a standard regularization to start with.
description_implications:
"L1 regularization penalizes the sum of absolute values
of the weights, whereas L2 regularization penalizes the sum of squares of
the weights.
The L1 regularization solution is sparse, meaning some weights will be zero,
others will be large.
The L2 regularization solution is non-sparse, most weights will be small.
L2 regularization does not perform feature
selection, since weights are only reduced to values near 0 instead of 0.
L1 regularization implicitly performs feature selection. L1 regularization is more
robust to outliers."
expected_impact: 3
literature_references:
- "https://neptune.ai/blog/fighting-overfitting-with-l1-or-l2-regularization#:~:text=The%20differences%20between%20L1%20and,regularization%20solution%20is%20non%2Dsparse. "
related_parameters:
- regularization_lambda
suggested_values: L2
ui_display_name: Regularization Type
should_shuffle:
default_value_reasoning:
In general, it's a good idea to mix up data on each batch
so that the neural network gets the broadest exposure to the dataset.
description_implications:
Turning off mini-batch shuffling can make training faster,
but it may lead to worse performance overall as shuffling helps mitigate overfitting.
expected_impact: 1
literature_references:
- "https://stats.stackexchange.com/questions/245502/why-should-we-shuffle-data-while-training-a-neural-network#:~:text=it%20helps%20the%20training%20converge,the%20order%20of%20the%20training "
suggested_values: true
suggested_values_reasoning:
One of the most powerful things about neural networks
is that they can be very complex functions, allowing one to learn very complex
relationships between your input and output data. These relationships can
include things you would never expect, such as the order in which data is
fed in per epoch. If the order of data within each epoch is the same, then
the model may use this as a way of reducing the training error, which is a
sort of overfitting.
ui_display_name: Should Shuffle
steps_per_checkpoint:
default_value_reasoning:
By default, we evaluate once per epoch, which scales
according to the dataset size.
description_implications:
"Epoch-based evaluation (using the default: 0) is an
appropriate fit for tabular datasets, which are small, fit in memory, and
train quickly.
However, this is a poor fit for unstructured datasets, which tend to be much
larger, and train more slowly due to larger models.
It's important to setup evaluation such that you do not wait several hours
before getting a single evaluation result. In general, it is not necessary
for models to train over the entirety of a dataset, nor evaluate over the
entirety of a test set, to produce useful monitoring metrics and signals to
indicate model health.
It is also more engaging and more valuable to ensure a frequent pulse of evaluation
metrics, even if they are partial."
expected_impact: 1
related_parameters:
- checkpoints_per_epoch
suggested_values: 1000-10000 for larger datasets
suggested_values_reasoning:
Running evaluation too frequently can be wasteful
while running evaluation not frequently enough can be prohibitively uninformative.
In many large-scale training runs, evaluation is often configured to run on
a sub-epoch time scale, or every few thousand steps.
ui_display_name: Steps Per Checkpoint
train_steps:
default_value_reasoning:
This defaults to `epochs`, which is a very high training
length ceiling. Models will almost always hit early stopping criteria before
reaching the absolute end of the training runway.
description_implications:
Decreasing this parameter will shorten the overall runway for
training the model.
expected_impact: 1
related_parameters:
- epochs
suggested_values: Leave unset, or 1000000, 1 for debugging
suggested_values_reasoning:
Usually it's sensible to leave the value of this parameter very high and
rely on a solid early stopping policy to dictate when the model should stop
training. Some models and hyperparameter configurations require many epochs
through the dataset to converge while others converge before a single epoch
through the data.
ui_display_name: Train Steps
eval_steps:
default_value_reasoning:
The default value is None because we do not want to lower the number of evaluation steps
by default, and we do not know the size of an arbitrary dataset.
By setting the default to None, we simply evaluate on the full evaluation set.
description_implications:
The smaller this value of this parameter, the less time evaluation will take.
expected_impact: 2
suggested_values: Depends on data size and prioritization of quality vs. speed
suggested_values_reasoning:
Normally, evaluation should use the entire evaluation set, and this is
recommended to achieve the highest quality evaluation. However, using
the full evaluation set can be slow, so the value of this parameter should
be set depending on which is more important for the task at hand -- quality
or speed.
ui_display_name: Evaluation Steps
use_mixed_precision:
default_value_reasoning:
Speed up training by using float16 parameters where it
makes sense.
description_implications:
Mixed precision training on GPU can dramatically speedup
training, with some risks to model convergence.
expected_impact: 3
literature_references:
- https://pytorch.org/blog/what-every-user-should-know-about-mixed-precision-training-in-pytorch/
suggested_values: false
suggested_values_reasoning:
Suggested to enable this if training is taking too
long on GPU.
ui_display_name: Use Mixed Precision
compile:
default_value_reasoning:
Model compilation has been shown to significantly speedup training by upwards of 20%, but does impose
some delay to compile the model at the beginning of training. This feature is experimental for now,
but may become the default in future versions.
description_implications:
Model compilation on GPU, when used in conjunction with automatic mixed precision, can speed up training
by upwards of 20%.
expected_impact: 3
suggested_values: false
suggested_values_reasoning:
Suggested to enable this if training is taking too
long on GPU.
ui_display_name: Compile
gradient_accumulation_steps:
default_value_reasoning:
Gradient accumulation is something that should be enabled only once it has been observed that either GPU
utilization is low due to low bandwidth between distributed workers, or that there is too much variance
in the training process due to very low batch sizes.
description_implications:
Gradient accumulation is useful to (1) reduce network bandwidth overhead in multi-node distributed training
scenarios where bandwidth is the bottleneck, and (2) train with larger effective batch sizes when the max
batch size the GPU can accommodate is very small. The first scenario occurs when the interconnect between
nodes is slow, so performing gradient synchronization (allreduce) less frequently will speed up training.
The second scenario occurs in cases where the model being trained is very large (e.g., LLM) so training with
a larger batch size will help to smooth out the variance from training with a very small batch size.
expected_impact: 2
suggested_values: false
suggested_values_reasoning:
Suggested to enable this if training is proceeding very slowly in distributed training (and GPU
utilization is low), or the batch size is very small and the loss curves look very spiky.
ui_display_name: Gradient Accumulation Steps
enable_gradient_checkpointing:
expected_impact: 2
ui_display_name: Enable Gradient Checkpointing
default_value_reasoning:
Gradient checkpointing is a technique to reduce the memory footprint of the model by
trading compute for memory. This is useful when training very large models that run into out of memory
errors very quickly during training. It is particularly helpful when doing non-quantization based training
(adapter based or full fine-tuning). Gradient checkpointing works by recomputing the activations of the
model during the backward pass, rather than storing them in memory during the forward pass.
This is a tradeoff between compute and memory, as the activations need to be recomputed during
the backward pass, but the memory footprint is reduced. This is set to false by default because
it is not always beneficial to use gradient checkpointing, and it can sometimes slow down training.
validation_field:
default_value_reasoning:
Concrete evaluation metrics are usually better than loss,
the penalty for a bad prediction, which is only a proxy for prediction correctness.
description_implications:
This parameter affects 1) what the early stopping policy
looks at to determine when to early stop and 2) hyperparameter optimization
for determining the best trial.
expected_impact: 1
related_parameters:
- validation_metric
suggested_values: default behavior
ui_display_name: Validation Field
validation_metric:
description_implications:
This parameter affects 1) what the early stopping policy
looks at to determine when to early stop and 2) hyperparameter optimization
for determining the best trial.
expected_impact: 1
related_parameters:
- validation_field
suggested_values: default behavior
ui_display_name: Validation Metric
learning_rate_scheduler:
warmup_evaluations:
default_value_reasoning:
"Learning rate warmup is most commonly used when training with large batch sizes / distributed
training to avoid taking overly large steps at the beginning of training that might result in the
process getting stuck in a local optimum. Conventional wisdom when training with large batch sizes is
to use a larger learning rate (see: `learning_rate_scaling`) but gradually warm up to the larger learning
rate over a few epochs of training in the beginning.
Even when not training with large batch sizes, the randomness of how weights are initialized can result
in strange, noisy gradient updates during the beginning of your training run. As such, it's generally
recommended to use a small amount of warmup (e.g., 1 epoch / evaluation) even when the batch size is
relatively small."
description_implications:
Learning rate warmup sets a very low learning rate at the beginning of training and gradually
(linearly) increases to the base learning rate each step (batch) during training.
After your warmup steps you use your "regular" learning rate or learning rate scheduler.
expected_impact: 2
related_parameters:
- warmup_fraction
- learning_rate_scaling
literature_references:
- https://arxiv.org/abs/1711.00489
- https://datascience.stackexchange.com/questions/55991/in-the-context-of-deep-learning-what-is-training-warmup-steps
suggested_values: 0 - 5
suggested_values_reasoning:
You don't want to warm up for too long, as after the model is starting to hill climb, you want to use the
full weight of the learning rate to descend into good loss minima.
If you observe your loss curve converging very early into training, within the first few epochs, then
increasing learning rate warmup may help to mitigate this effect. Pretrained models can benefit from more
warmup to help offset the effects of catastrophic forgetting due to an overly high learning rate.
ui_display_name: Warmup Evaluations
warmup_fraction:
default_value_reasoning:
Similar to `warmup_evaluations` but expressed as a fraction of the total number of training steps, rather
that a certain number of evaluation phases.
description_implications: See `warmup_evaluations`.
expected_impact: 2
related_parameters:
- warmup_evaluations
- learning_rate_scaling
suggested_values: 0.05 - 0.2
suggested_values_reasoning:
You don't want to warm up for too long, as after the
model is starting to hill climb, you want to use the full weight of the learning
rate to descend into good loss minima.
ui_display_name: Warmup Fraction
decay:
description_implications:
"It\u2019s almost always a good idea to use a schedule.\
\ For most models, try the exponential decay schedule first.\n\nThe exponential\
\ schedule divides the learning rate by the same factor (%) every epoch. This\
\ means that the learning rate will decrease rapidly in the first few epochs,\
\ and spend more epochs with a lower value, but never reach exactly zero.\
\ As a rule of thumb, compared to training without a schedule, you can use\
\ a slightly higher maximum learning rate. Since the learning rate changes\
\ over time, the whole training is not so sensitive to the value picked."
expected_impact: 3
literature_references:
- "https://peltarion.com/knowledge-center/documentation/modeling-view/run-a-model/optimization-principles-(in-deep-learning)/learning-rate-schedule "
related_parameters:
- decay_rate
- decay_steps
- learning_rate
suggested_values: exponential
suggested_values_reasoning:
Starting with exponential decay is a safe place to start, as it is a "softer" decrease in the learning
rate over time, as compared with linear, which is more steep after the initial drop. Linear decay is
most useful when the risk of catastrophic forgetting is very high (e.g, for fine-tuning pretrained
models). Cosine annealing is a type of learning rate schedule that has the effect of starting with a
large learning rate that is relatively rapidly decreased to a minimum value before being increased
rapidly again. The resetting of the learning rate acts like a simulated restart of the learning process.
If you observe your loss curves shooting up (even on the training set) in later epochs, increasing the
decay rate may help mitigate this effect.
ui_display_name: Decay
decay_rate:
default_value_reasoning:
4-5% decay each step is an empirically useful decay
rate to start with.
description_implications:
Increasing the decay rate will lower the learning rate
faster. This could make the model more robust to a bad (too high) initial
learning rate, but a decay rate that is too high could prohibit the model
from learning anything at all.
expected_impact: 2
literature_references:
- "https://peltarion.com/knowledge-center/documentation/modeling-view/run-a-model/optimization-principles-(in-deep-learning)/learning-rate-schedule "
related_parameters:
- decay_steps
- learning_rate
suggested_values: 0.9 - 0.96
suggested_values_reasoning:
Since this controls exponential decay, even a small
decay rate will still be strongly impactful.
ui_display_name: Decay Rate
decay_steps:
default_value_reasoning:
This default essentially enables the `learning_rate`
to decay by a factor of the `decay_rate` at 10000 training steps.
description_implications:
By increasing the value of decay steps, you are increasing
the number of training steps it takes to decay the learning rate by a factor
of `decay_rate`. In other words, the bigger this parameter, the slower the
learning rate decays.
example_value:
- 5000
expected_impact: 2
related_parameters:
- decay_rate
- learning_rate
suggested_values: 10000 +/- 500 at a time
suggested_values_reasoning:
The decay in the learning rate is calculated as the
training step divided by the `decay_steps` plus one. Then the `decay_rate`
is raised to the power of this exponent which is then multiplied to the current
learning rate. All this to say that the learning rate is only decayed by a
factor of the set `decay_rate` when the training step reaches the `decay_steps`
and then subsequently when it reaches any multiple of `decay_steps`. You can
think of `decay_steps` as a rate of decay for the `decay_rate`.
ui_display_name: Decay Steps
staircase:
default_value_reasoning: Performs learning rate decay in stepwise discrete manner.
description_implications:
An excessively aggressive decay results in optimizers
never reaching the minima, whereas a slow decay leads to chaotic updates without
significant improvement. Discrete learning rate decay is another parameter to help
tune a balance.
expected_impact: 1
literature_references:
- https://neptune.ai/blog/how-to-choose-a-learning-rate-scheduler
suggested_values: false
suggested_values_reasoning:
We have not found strong evidence that discretely
decaying the learning rate is superior to doing so continuously in general,
but in specific tasks it might have a positive impact.
ui_display_name: Staircase
reduce_on_plateau:
expected_impact: 3
ui_display_name: Reduce On Plateau
reduce_on_plateau_patience:
expected_impact: 2
ui_display_name: Reduce On Plateau Patience
reduce_on_plateau_rate:
expected_impact: 2
ui_display_name: Reduce On Plateau Rate
reduce_eval_metric:
expected_impact: 1
ui_display_name: Reduce Eval Metric
reduce_eval_split:
expected_impact: 1
ui_display_name: Reduce Eval Split
t_0:
expected_impact: 1
ui_display_name: T_0
t_mult:
expected_impact: 1
ui_display_name: T_mult
eta_min:
expected_impact: 1
ui_display_name: Eta Min
gbm:
learning_rate:
commonly_used: true
default_value_reasoning: Middle of the road learning rate to start with.
description_implications:
The learning rate is a hyperparameter that controls
how much to change the model in response to the estimated error each time
the model weights are updated. Increasing the learning rate may decrease learning
curve stability but also increase learning speed and efficiency, leading to
faster model convergence. Decreasing the learning rate can help stabilize
learning curves at the cost of slower time to convergence.
expected_impact: 3
suggested_values: 0.00001 - 0.1 or auto
related_parameters:
- decay
suggested_values_reasoning:
Tabular models trained from scratch typically use
learning rates around 1e-3 while learning rates for pre-trained models should
be much smaller, typically around 1e-5, which is important to mitigate catastrophic
forgetting. To make the model more robust to any specific choice of learning
rate, consider turning enabling learning rate decay.
ui_display_name: Learning Rate
early_stop:
default_value_reasoning:
Deep learning models are prone to overfitting. It's generally
a good policy to set up some early stopping criteria as it's not useful to
have a model train after it's maximized what it can learn. 5 consecutive rounds
of evaluation where there hasn't been any improvement on the validation set
(including chance) is a reasonable policy to start with.
description_implications:
Decreasing this value is a more aggressive policy. Decreasing
early stopping makes model training less forgiving, as the model has less
runway to demonstrate consecutive metric improvements before the training
run is quit. This can be efficient for pruning bad models earlier, but since
the training process is inherently non-deterministic and noisy, sometimes
improvements happen very gradually over a long period of time.
Extending this value leads to longer training times,
but potentially also better final performance.
expected_impact: 3
related_parameters:
- epochs
- train_steps
suggested_values: 5 - 10
suggested_values_reasoning:
There's potentially a lot of randomness in how models
train, but so many consecutive rounds of no improvement is usually a good
indicator that the model converged or overfitted.
ui_display_name: Early Stop
eval_batch_size:
default_value_reasoning: Use the same batch size used for training.
description_implications:
By increasing the `eval_batch_size` past the `batch_size`
parameter set value, you allow for more parallelism in the batch evaluation
step and speed up evaluation. For example, if you have to evaluate the model
on a test set of size 1000, it is faster to evaluate two times with two batches
of size 500 as opposed to ten times with ten batches of 100.
Setting this parameter higher without getting past out memory limits
will speed up the model training process overall.
example_value:
- 512
expected_impact: 1
other_information:
Should only set the eval_batch_size to a level that you can fit
in memory.
related_parameters:
- batch_size
suggested_values:
- 256
- 512
- 1024
suggested_values_reasoning:
By observing memory consumption on training jobs,
you can get a sense of how much extra memory is available for increasing this
value. A good rule of thumb can be experimentally doubling the eval batch
size if you do not have insight into memory usage.
ui_display_name: Evaluation Batch Size
evaluate_training_set:
default_value_reasoning:
It could be useful to monitor evaluation metrics on the
training set to understand convergence.
description_implications:
Running evaluation on the full training set, when your
training set is large, can be a huge computational cost. Turning off training
set evaluation will lead to significant gains in training throughput and efficiency.
For small datasets that train and evaluate quickly, the choice is trivial.
expected_impact: 1
suggested_values: false
suggested_values_reasoning:
Running full-scale evaluation on the full training
set doesn't usually provide any useful information over the validation dataset.
Even with this set to False, continuous training loss metrics are still computed,
so it will still be easy to spot signs of overfitting like when the training-validation
loss curves diverge.
ui_display_name: Evaluate Training Set
validation_field:
default_value_reasoning:
Concrete evaluation metrics are usually better than loss,
the penalty for a bad prediction, which is only a proxy for prediction correctness.
description_implications:
This parameter affects 1) what the early stopping policy
looks at to determine when to early stop and 2) hyperparameter optimization
for determining the best trial.
expected_impact: 1
related_parameters:
- validation_metric
suggested_values: default behavior
ui_display_name: Validation Field
validation_metric:
description_implications:
This parameter affects 1) what the early stopping policy
looks at to determine when to early stop and 2) hyperparameter optimization
for determining the best trial.
expected_impact: 1
related_parameters:
- validation_field
suggested_values: default behavior
ui_display_name: Validation Metric
max_depth:
expected_impact: 3
drop_rate:
expected_impact: 2
tree_learner:
expected_impact: 2
boosting_type:
expected_impact: 3
boosting_rounds_per_checkpoint:
expected_impact: 2
num_boost_round:
expected_impact: 2
num_leaves:
expected_impact: 2
min_data_in_leaf:
expected_impact: 2
min_sum_hessian_in_leaf:
expected_impact: 1
bagging_fraction:
expected_impact: 3
pos_bagging_fraction:
expected_impact: 2
neg_bagging_fraction:
expected_impact: 2
bagging_freq:
expected_impact: 2
bagging_seed:
expected_impact: 2
feature_fraction:
expected_impact: 3
feature_fraction_bynode:
expected_impact: 2
feature_fraction_seed:
expected_impact: 2
extra_trees:
expected_impact: 3
extra_seed:
expected_impact: 2
max_delta_step:
expected_impact: 1
lambda_l1:
expected_impact: 3
lambda_l2:
expected_impact: 3
linear_lambda:
expected_impact: 2
min_gain_to_split:
expected_impact: 1
max_drop:
expected_impact: 2
skip_drop:
expected_impact: 2
xgboost_dart_mode:
expected_impact: 1
uniform_drop:
expected_impact: 2
drop_seed:
expected_impact: 2
top_rate:
expected_impact: 1
other_rate:
expected_impact: 1
min_data_per_group:
expected_impact: 1
max_cat_threshold:
expected_impact: 1
cat_l2:
expected_impact: 1
cat_smooth:
expected_impact: 1
max_cat_to_onehot:
expected_impact: 1
cegb_tradeoff:
expected_impact: 1
cegb_penalty_split:
expected_impact: 1
path_smooth:
expected_impact: 1
verbose:
expected_impact: 1
max_bin:
expected_impact: 1
feature_pre_filter:
expected_impact: 1
llm:
type:
commonly_used: true
default_value_reasoning:
It's useful to start with zero-shot or few-shot learning to see what the model
can do as a baseline before fine-tuning.
suggested_values: none or finetune
suggested_values_reasoning:
If you want to perform zero shot learning or few shot learning, you should set this to `none`.
If you want to perform fine-tuning, you should set this to `finetune`.
ui_display_name: Trainer Type
expected_impact: 3
================================================
FILE: ludwig/schema/metadata/feature_metadata.py
================================================
================================================
FILE: ludwig/schema/metadata/parameter_metadata.py
================================================
import json
from dataclasses import dataclass
from enum import Enum
from typing import Any
from dataclasses_json import dataclass_json
from ludwig.api_annotations import DeveloperAPI
from ludwig.utils.misc_utils import memoized_method
@DeveloperAPI
class ExpectedImpact(int, Enum):
"""The expected impact of determining a "good" value for a specific parameter.
- HIGH: this parameter should almost always be included in a hyperopt run and can make or break a good model.
- MEDIUM: this parameter can sometimes make or break a good model.
- LOW: this parameter usually does not have a significant impact on model performance.
"""
UNKNOWN = 0
LOW = 1
MEDIUM = 2
HIGH = 3
@DeveloperAPI
class ComputeTier(int, Enum):
"""The compute tier defines the type of compute resources that a model typically needs to get good
throughput."""
CPU = 0
"""Model can train effectively on CPU hardware."""
GPU_LOW = 1
"""Model can train effectively on commodity GPU hardware, or inference optimized SKUs like NVIDIA T4."""
GPU_MEDIUM = 2
"""Model can train effectively on training-optimized GPU hardware like V100, A10G, or A5000."""
GPU_HIGH = 3
"""Model requires high-end GPUs like A100 or H100 to achieve good throughput."""
@DeveloperAPI
@dataclass_json()
@dataclass
class ParameterMetadata:
"""Contains descriptive information that pertains to a Ludwig configuration parameter."""
short_description: str = ""
"""Quick description generally for UI display."""
long_description: str = ""
"""In depth description generally for documentation purposes."""
ui_display_name: str | None = ""
"""How this parameter can be displayed in a human-readable form."""
default_value_reasoning: str | None = None
"""The reasoning behind the default value for this parameter."""
example_value: list[Any] | None = None
"""Examples of other values that can be used for this parameter."""
related_parameters: list[str] | None = None
"""List of related parameters that this parameter interacts with or depends on."""
other_information: str | None = None
"""Other information that is relevant for this parameter."""
description_implications: str | None = None
"""The intuition for how model performance would change if this parameter is changed."""
suggested_values: Any = None
"""What values would a machine learning expert suggest users try to help improve their model?
Should cover 95% (2-sigma) worth of use-cases.
"""
suggested_values_reasoning: str | None = None
"""The reasoning behind the suggested values, as well as model performance indicators or other intuition that
could help inform a user to make an educated decision about what values to experiment with for this
parameter."""
commonly_used: bool = False
"""True if this parameter could be frequently used, would have a high impact, and/or would be interesting for a
machine learning practitioner."""
expected_impact: ExpectedImpact = ExpectedImpact.UNKNOWN
"""The expected impact of determining a "good" value for this parameter."""
literature_references: list[str] | None = None
"""List of links, papers, and blog posts to learn more."""
internal_only: bool = False
"""True if this parameter is used strictly internally and should not be exposed to users."""
compute_tier: ComputeTier = ComputeTier.CPU
"""The compute tier defines the type of compute resources that a model typically needs to get good
throughput."""
ui_component_type: str | None = None
"""Override for HTML component type that should be used to render this field in UIs."""
@memoized_method(maxsize=1)
def to_json_dict(self) -> dict[str, Any]:
return json.loads(self.to_json())
@DeveloperAPI
def convert_metadata_to_json(pm: ParameterMetadata) -> dict[str, Any]:
"""Converts a ParameterMetadata dict to a normal JSON dict.
NOTE: Without the json.loads call, to_json() returns
a string repr that is improperly parsed.
"""
if not pm:
return ParameterMetadata().to_json_dict()
return pm.to_json_dict()
# This is a quick way to flag schema parameters as internal only via the `parameter_metadata` argument
INTERNAL_ONLY = ParameterMetadata(internal_only=True)
================================================
FILE: ludwig/schema/model_config.py
================================================
# TODO(travis) consider removing this in the future after deprecation period
from ludwig.schema.model_types.base import ModelConfig # noqa
================================================
FILE: ludwig/schema/model_types/__init__.py
================================================
import ludwig.schema.model_types.ecd # noqa
import ludwig.schema.model_types.llm # noqa
================================================
FILE: ludwig/schema/model_types/base.py
================================================
import copy
from abc import ABC
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.config_validation.checks import get_config_check_registry
from ludwig.config_validation.validation import check_schema
from ludwig.constants import (
BACKEND,
COLUMN,
DEPENDENCIES,
ENCODER,
INPUT_FEATURES,
MODEL_ECD,
NAME,
OUTPUT_FEATURES,
TIED,
)
from ludwig.error import ConfigValidationError
from ludwig.globals import LUDWIG_VERSION
from ludwig.schema import utils as schema_utils
from ludwig.schema.defaults.base import BaseDefaultsConfig
from ludwig.schema.features.base import BaseInputFeatureConfig, BaseOutputFeatureConfig, FeatureCollection
from ludwig.schema.hyperopt import HyperoptConfig
from ludwig.schema.model_types.utils import (
merge_fixed_preprocessing_params,
merge_with_defaults,
sanitize_and_filter_combiner_entities_,
set_derived_feature_columns_,
set_hyperopt_defaults_,
set_llm_parameters,
set_preprocessing_parameters,
set_tagger_decoder_parameters,
set_validation_parameters,
)
from ludwig.schema.preprocessing import PreprocessingConfig
from ludwig.schema.trainer import BaseTrainerConfig
from ludwig.schema.utils import ludwig_dataclass
from ludwig.types import ModelConfigDict
from ludwig.utils.backward_compatibility import upgrade_config_dict_to_latest_version
from ludwig.utils.data_utils import get_sanitized_feature_name, load_yaml
from ludwig.utils.registry import Registry
model_type_schema_registry = Registry()
@DeveloperAPI
@ludwig_dataclass
class ModelConfig(schema_utils.BaseMarshmallowConfig, ABC):
input_features: FeatureCollection[BaseInputFeatureConfig]
output_features: FeatureCollection[BaseOutputFeatureConfig]
model_type: str
trainer: BaseTrainerConfig
preprocessing: PreprocessingConfig
defaults: BaseDefaultsConfig
hyperopt: HyperoptConfig | None = None
backend: dict[str, Any] = schema_utils.Dict() # TODO(jeffkinnison): Add backend schema
ludwig_version: str = schema_utils.ProtectedString(LUDWIG_VERSION)
def __post_init__(self):
merge_fixed_preprocessing_params(self)
set_validation_parameters(self)
set_hyperopt_defaults_(self)
set_tagger_decoder_parameters(self)
sanitize_and_filter_combiner_entities_(self)
# Reconcile LLM parameters
set_llm_parameters(self)
# Reconcile conflicting preprocessing parameters
set_preprocessing_parameters(self)
# Derive proc_col for each feature from the feature's preprocessing parameters
# after all preprocessing parameters have been set
set_derived_feature_columns_(self)
# Auxiliary checks.
get_config_check_registry().check_config(self)
@staticmethod
def from_dict(config: ModelConfigDict) -> "ModelConfig":
config = copy.deepcopy(config)
config = upgrade_config_dict_to_latest_version(config)
# Use sanitized feature names.
# NOTE: This must be kept consistent with build_dataset()
for input_feature in config[INPUT_FEATURES]:
input_feature[NAME] = get_sanitized_feature_name(input_feature[NAME])
if COLUMN in input_feature and input_feature[COLUMN]:
input_feature[COLUMN] = get_sanitized_feature_name(input_feature[COLUMN])
for output_feature in config[OUTPUT_FEATURES]:
output_feature[NAME] = get_sanitized_feature_name(output_feature[NAME])
if COLUMN in output_feature and output_feature[COLUMN]:
output_feature[COLUMN] = get_sanitized_feature_name(output_feature[COLUMN])
# Sanitize tied feature names.
for input_feature in config[INPUT_FEATURES]:
if TIED in input_feature and input_feature[TIED]:
input_feature[TIED] = get_sanitized_feature_name(input_feature[TIED])
# Sanitize dependent feature names.
for output_feature in config[OUTPUT_FEATURES]:
if DEPENDENCIES in output_feature and output_feature[DEPENDENCIES]:
output_feature[DEPENDENCIES] = [
get_sanitized_feature_name(feature_name) for feature_name in output_feature[DEPENDENCIES]
]
config["model_type"] = config.get("model_type", MODEL_ECD)
model_type = config["model_type"]
if model_type not in model_type_schema_registry:
raise ConfigValidationError(
f"Invalid model type: '{model_type}', expected one of: {list(model_type_schema_registry.keys())}"
)
config = merge_with_defaults(config)
# TODO(travis): handle this with helper function
backend = config.get(BACKEND)
if isinstance(backend, str):
config[BACKEND] = {"type": backend}
# JSON schema validation. Note that this is desireable on top of `schema.load(config)` below because marshmallow
# deserialization permits additional properties while JSON schema validation, for schema (e.g. `trainer`) that
# have `additionalProperties=False`, does not.
#
# Illustrative example: test_validate_config_misc.py::test_validate_no_trainer_type
#
# TODO: Set `additionalProperties=False` for all Ludwig schema, and look into passing in `unknown='RAISE'` to
# marshmallow.load(), which raises an error for unknown fields during deserialization.
# https://marshmallow.readthedocs.io/en/stable/marshmallow.schema.html#marshmallow.schema.Schema.load
check_schema(config)
cls = model_type_schema_registry[model_type]
schema = cls.get_class_schema()()
try:
config_obj: ModelConfig = schema.load(config)
except ConfigValidationError:
raise
except ValueError as e:
raise ConfigValidationError(f"Config validation error raised during config deserialization: {e}") from e
except (OSError, ValueError) as e:
raise ConfigValidationError(f"Config validation error raised during config post-init: {e}") from e
return config_obj
@staticmethod
def from_yaml(config_path: str) -> "ModelConfig":
return ModelConfig.from_dict(load_yaml(config_path))
def get_feature_names(self) -> set[str]:
"""Returns a set of all feature names."""
feature_names = set()
feature_names.update([f.column for f in self.input_features])
feature_names.update([f.column for f in self.output_features])
return feature_names
def get_feature_config(self, feature_column_name: str) -> BaseInputFeatureConfig | None:
"""Returns the feature config for the given feature name."""
for feature in self.input_features:
if feature.column == feature_column_name:
return feature
for feature in self.output_features:
if feature.column == feature_column_name:
return feature
@DeveloperAPI
def register_model_type(name: str):
def wrap(model_type_config: ModelConfig) -> ModelConfig:
model_type_schema_registry[name] = model_type_config
return model_type_config
return wrap
def _merge_encoder_cache_params(preprocessing_params: dict[str, Any], encoder_params: dict[str, Any]) -> dict[str, Any]:
if preprocessing_params.get("cache_encoder_embeddings"):
preprocessing_params[ENCODER] = encoder_params
return preprocessing_params
================================================
FILE: ludwig/schema/model_types/ecd.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.combiners.base import BaseCombinerConfig
from ludwig.schema.combiners.utils import CombinerSelection
from ludwig.schema.defaults.ecd import ECDDefaultsConfig, ECDDefaultsField
from ludwig.schema.features.base import (
BaseInputFeatureConfig,
BaseOutputFeatureConfig,
ECDInputFeatureSelection,
ECDOutputFeatureSelection,
FeatureCollection,
)
from ludwig.schema.hyperopt import HyperoptConfig, HyperoptField
from ludwig.schema.model_types.base import ModelConfig, register_model_type
from ludwig.schema.preprocessing import PreprocessingConfig, PreprocessingField
from ludwig.schema.trainer import ECDTrainerConfig, ECDTrainerField
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_model_type(name="ecd")
@ludwig_dataclass
class ECDModelConfig(ModelConfig):
"""Parameters for ECD."""
model_type: str = schema_utils.ProtectedString("ecd")
input_features: FeatureCollection[BaseInputFeatureConfig] = ECDInputFeatureSelection().get_list_field()
output_features: FeatureCollection[BaseOutputFeatureConfig] = ECDOutputFeatureSelection().get_list_field()
combiner: BaseCombinerConfig = CombinerSelection().get_default_field()
trainer: ECDTrainerConfig = ECDTrainerField().get_default_field()
preprocessing: PreprocessingConfig = PreprocessingField().get_default_field()
defaults: ECDDefaultsConfig = ECDDefaultsField().get_default_field()
hyperopt: HyperoptConfig | None = HyperoptField().get_default_field()
================================================
FILE: ludwig/schema/model_types/llm.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.schema import utils as schema_utils
from ludwig.schema.defaults.llm import LLMDefaultsConfig, LLMDefaultsField
from ludwig.schema.features.base import (
BaseInputFeatureConfig,
BaseOutputFeatureConfig,
FeatureCollection,
LLMInputFeatureSelection,
LLMOutputFeatureSelection,
)
from ludwig.schema.hyperopt import HyperoptConfig, HyperoptField
from ludwig.schema.llms.base_model import BaseModelDataclassField
from ludwig.schema.llms.generation import LLMGenerationConfig, LLMGenerationConfigField
from ludwig.schema.llms.model_parameters import ModelParametersConfig, ModelParametersConfigField
from ludwig.schema.llms.peft import AdapterDataclassField, BaseAdapterConfig
from ludwig.schema.llms.prompt import PromptConfig, PromptConfigField
from ludwig.schema.llms.quantization import QuantizationConfig, QuantizationConfigField
from ludwig.schema.model_types.base import ModelConfig, register_model_type
from ludwig.schema.preprocessing import PreprocessingConfig, PreprocessingField
from ludwig.schema.trainer import LLMTrainerConfig, LLMTrainerDataclassField
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@register_model_type(name="llm")
@ludwig_dataclass
class LLMModelConfig(ModelConfig):
"""Parameters for LLM Model Type."""
model_type: str = schema_utils.ProtectedString("llm")
base_model: str = BaseModelDataclassField()
input_features: FeatureCollection[BaseInputFeatureConfig] = LLMInputFeatureSelection().get_list_field()
output_features: FeatureCollection[BaseOutputFeatureConfig] = LLMOutputFeatureSelection().get_list_field()
preprocessing: PreprocessingConfig = PreprocessingField().get_default_field()
defaults: LLMDefaultsConfig | None = LLMDefaultsField().get_default_field()
hyperopt: HyperoptConfig | None = HyperoptField().get_default_field()
prompt: PromptConfig = PromptConfigField().get_default_field()
# trainer: LLMTrainerConfig = LLMTrainerField().get_default_field()
trainer: LLMTrainerConfig = LLMTrainerDataclassField(
description="The trainer to use for the model",
)
generation: LLMGenerationConfig = LLMGenerationConfigField().get_default_field()
adapter: BaseAdapterConfig | None = AdapterDataclassField()
quantization: QuantizationConfig | None = QuantizationConfigField().get_default_field()
model_parameters: ModelParametersConfig | None = ModelParametersConfigField().get_default_field()
trust_remote_code: bool = schema_utils.Boolean(
default=False,
description=(
"Whether to trust and execute remote code from the HuggingFace model repository. "
"Required for some models (e.g. Phi-2, Qwen) that use custom architectures. "
"Only enable this for models you trust."
),
)
================================================
FILE: ludwig/schema/model_types/utils.py
================================================
import copy
import logging
import sys
import warnings
from collections.abc import Mapping
from typing import Any, TYPE_CHECKING
from transformers import AutoConfig
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
CATEGORY,
COMBINED,
DECODER,
DEFAULTS,
ENCODER,
GRID_SEARCH,
INPUT_FEATURES,
LOSS,
MODEL_ECD,
MODEL_LLM,
OUTPUT_FEATURES,
PARAMETERS,
PREPROCESSING,
SEQUENCE,
SPACE,
TEXT,
TYPE,
)
from ludwig.error import ConfigValidationError
from ludwig.features.feature_utils import compute_feature_hash
from ludwig.schema.features.utils import output_config_registry
from ludwig.schema.hyperopt.scheduler import BaseHyperbandSchedulerConfig
from ludwig.schema.llms.generation import LLMGenerationConfig
from ludwig.schema.trainer import ECDTrainerConfig
from ludwig.types import HyperoptConfigDict, ModelConfigDict
from ludwig.utils.data_utils import get_sanitized_feature_name
from ludwig.utils.llm_utils import get_context_len
if TYPE_CHECKING:
from ludwig.schema.model_types.base import ModelConfig
logger = logging.getLogger(__name__)
@DeveloperAPI
def merge_with_defaults(config_dict: ModelConfigDict) -> ModelConfigDict:
# Recursive merge of the features, except that if we find a dictionary containing
# an explicit "type" key, we ignore defaults if they don't match.
defaults = config_dict.get(DEFAULTS)
if not defaults:
return config_dict
config_dict = copy.deepcopy(config_dict)
_merge_features_(config_dict.get(INPUT_FEATURES, []), defaults, {DECODER, LOSS})
_merge_features_(config_dict.get(OUTPUT_FEATURES, []), defaults, {ENCODER, PREPROCESSING})
return config_dict
def _merge_features_(features: list[dict[str, Any]], defaults: dict[str, Any], exclude_keys: set[str]):
for feature in features:
ftype = feature.get(TYPE)
if not ftype:
continue
default_feature = defaults.get(ftype, {})
merged_feature = _merge_dict_with_types(default_feature, feature, exclude_keys)
# In-place replacement of the old feature with the new
feature.clear()
feature.update(merged_feature)
def _merge_dict_with_types(dct: dict[str, Any], merge_dct: dict[str, Any], exclude_keys: set[str]) -> dict[str, Any]:
dct = copy.deepcopy(dct)
dct = {k: v for k, v in dct.items() if k not in exclude_keys}
for k, v in merge_dct.items():
# TODO(travis): below type comparison is not perfect, as it doesn't consider the case where the default type
# is omitted while the encoder type is explicitly set to the default type, in which case they
# should resolve to equal, but will be considered different.
if (
k in dct
and isinstance(dct[k], dict)
and isinstance(v, Mapping)
and dct[k].get(TYPE) == v.get(TYPE, dct[k].get(TYPE))
):
dct[k] = _merge_dict_with_types(dct[k], v, exclude_keys)
else:
dct[k] = v
return dct
@DeveloperAPI
def merge_fixed_preprocessing_params(config: "ModelConfig"):
"""Update preprocessing parameters if encoders require fixed preprocessing parameters."""
for feature in config.input_features:
feature.encoder.set_fixed_preprocessing_params(config.model_type, feature.preprocessing)
def set_validation_parameters(config: "ModelConfig"):
"""Sets validation-related parameters used for early stopping, determining the best hyperopt trial, etc."""
if not config.output_features:
return
# First set the validation field so we know what feature we're validating on
if not config.trainer.validation_field:
if config.trainer.validation_metric is None or config.trainer.validation_metric == LOSS:
# Loss is valid for all features.
config.trainer.validation_field = config.output_features[0].name
else:
# Determine the proper validation field for the user, like if the user specifies "accuracy" but forgets to
# change the validation field from "combined" to the name of the feature that produces accuracy metrics.
from ludwig.utils.metric_utils import get_feature_to_metric_names_map
feature_to_metric_names_map = get_feature_to_metric_names_map(config.output_features.to_list())
validation_field = None
for feature_name, metric_names in feature_to_metric_names_map.items():
if config.trainer.validation_metric in metric_names:
if validation_field is None:
validation_field = feature_name
else:
raise ConfigValidationError(
f"The validation_metric: '{config.trainer.validation_metric}' corresponds to multiple "
f"possible validation_fields, '{validation_field}' and '{feature_name}'. Please explicitly "
"specify the validation_field that should be used with the validation_metric "
f"'{config.trainer.validation_metric}'."
)
if validation_field is None:
raise ConfigValidationError(
"User-specified trainer.validation_metric is not valid for any output feature."
)
config.trainer.validation_field = validation_field
# If the field is combined, then make sure the metric is loss and then return
if config.trainer.validation_field == COMBINED:
# Only loss is supported for combined
if not config.trainer.validation_metric:
config.trainer.validation_metric = LOSS
elif config.trainer.validation_metric != LOSS:
raise ConfigValidationError(
f"Must set validation_metric=loss when validation_field=combined, "
f"found validation_metric={config.trainer.validation_metric}"
)
return
# Field is not combined, so use the default validation metric for the single feature
validation_features = [f for f in config.output_features if f.name == config.trainer.validation_field]
if len(validation_features) > 1:
raise ConfigValidationError(
f"Found more than one feature matching validation field: {config.trainer.validation_field}"
)
if len(validation_features) == 0:
raise ConfigValidationError(
f"No output feature found matching validation field: {config.trainer.validation_field}"
)
validation_feature = validation_features[0]
if not config.trainer.validation_metric:
# The user has not explicitly set any validation fields.
# Default to using the first output feature's default validation metric.
out_type = validation_feature.type
config.trainer.validation_metric = output_config_registry(config.model_type)[out_type].default_validation_metric
def set_derived_feature_columns_(config_obj: "ModelConfig"):
"""Assigns column and proc_column values to features that do not have them set.
Proc_column is set to a hash of the feature's preprocessing configuration.
"""
for feature in config_obj.input_features:
if feature.column is None:
feature.column = feature.name
if feature.proc_column is None:
feature.proc_column = compute_feature_hash(feature.to_dict())
for feature in config_obj.output_features:
if feature.column is None:
feature.column = feature.name
if feature.proc_column is None:
feature.proc_column = compute_feature_hash(feature.to_dict())
def sanitize_and_filter_combiner_entities_(config: "ModelConfig"):
if config.model_type != MODEL_ECD or config.combiner.type != "comparator":
return
input_feature_names = {input_feature.name for input_feature in config.input_features}
# Sanitize feature names.
config.combiner.entity_1 = [get_sanitized_feature_name(fname) for fname in config.combiner.entity_1]
config.combiner.entity_2 = [get_sanitized_feature_name(fname) for fname in config.combiner.entity_2]
entity_1_excluded = {fname for fname in config.combiner.entity_1 if fname not in input_feature_names}
if entity_1_excluded:
logger.warning(
f"Excluding `entity_1` features {entity_1_excluded} from the comparator combiner because they are not "
f"present in the `input_features`."
)
config.combiner.entity_1 = [fname for fname in config.combiner.entity_1 if fname not in entity_1_excluded]
entity_2_excluded = {fname for fname in config.combiner.entity_2 if fname not in input_feature_names}
if entity_2_excluded:
logger.warning(
f"Excluding `entity_2` features {entity_2_excluded} from the comparator combiner because they are not "
f"present in the `input_features`."
)
config.combiner.entity_2 = [fname for fname in config.combiner.entity_2 if fname not in entity_2_excluded]
def set_hyperopt_defaults_(config: "ModelConfig"):
"""This function was migrated from defaults.py with the intention of setting some hyperopt defaults while the
hyperopt section of the config object is not fully complete.
Returns:
None -> modifies trainer and hyperopt sections
"""
if not config.hyperopt:
return
# Set default num_samples based on search space if not set by user
if config.hyperopt.executor.num_samples is None:
_contains_grid_search_params = contains_grid_search_parameters(config.hyperopt.to_dict())
if _contains_grid_search_params:
logger.info(
"Setting hyperopt num_samples to 1 to prevent duplicate trials from being run. Duplicate trials are"
" created when there are hyperopt parameters that use the `grid_search` search space.",
)
config.hyperopt.executor.num_samples = 1
else:
logger.info("Setting hyperopt num_samples to 10.")
config.hyperopt.executor.num_samples = 10
scheduler = config.hyperopt.executor.scheduler
if scheduler.type == "fifo":
# FIFO scheduler has no constraints
return
# Disable early stopping when using a scheduler. We achieve this by setting the parameter
# to -1, which ensures the condition to apply early stopping is never met.
early_stop = config.trainer.early_stop
if early_stop is not None and early_stop != -1:
warnings.warn("Can't utilize `early_stop` while using a hyperopt scheduler. Setting early stop to -1.")
config.trainer.early_stop = -1
if isinstance(config.trainer, ECDTrainerConfig) and isinstance(scheduler, BaseHyperbandSchedulerConfig):
# TODO(travis): explore similar constraints for other model types that may not have epochs
max_t = scheduler.max_t
time_attr = scheduler.time_attr
epochs = config.trainer.epochs
if max_t is not None:
if time_attr == "time_total_s":
if epochs is None:
# Continue training until time limit hit
config.trainer.epochs = sys.maxsize
# else continue training until either time or trainer epochs limit hit
elif epochs is not None and epochs != max_t:
raise ValueError(
"Cannot set trainer `epochs` when using hyperopt scheduler w/different training_iteration `max_t`. "
"Unset one of these parameters in your config or make sure their values match."
)
else:
# Run trainer until scheduler epochs limit hit
config.trainer.epochs = max_t
elif epochs is not None:
scheduler.max_t = epochs # run scheduler until trainer epochs limit hit
def set_preprocessing_parameters(config: "ModelConfig") -> None: # noqa: F821
"""Reconcile conflicting preprocessing parameters in place."""
_set_max_sequence_length(config)
def _set_max_sequence_length(config: "ModelConfig") -> None: # noqa: F821
"""Ensures that `max_sequence_length` is never less than `sequence_length`."""
types_with_sequence_length = [SEQUENCE, TEXT]
for input_feature in config.input_features:
if input_feature.type in types_with_sequence_length:
sequence_length = input_feature.preprocessing.sequence_length
max_sequence_length = input_feature.preprocessing.max_sequence_length
if sequence_length is not None and sequence_length > max_sequence_length:
warnings.warn(
"if `sequence_length` is not None, `max_sequence_length` must be greater than or equal "
"to `sequence_length`. Setting `max_sequence_length` to `sequence_length`."
)
input_feature.preprocessing.max_sequence_length = sequence_length
def set_tagger_decoder_parameters(config: "ModelConfig") -> None:
"""Overrides the reduce_input parameter for text and sequence output features when a tagger decoder is used.
This is done to ensure that the decoder correctly gets a 3D tensor as input.
Returns:
None -> modifies output_features
"""
for output_feature in config.output_features:
if output_feature.type in {TEXT, SEQUENCE} and output_feature.decoder.type == "tagger":
if output_feature.reduce_input is not None:
warnings.warn(
"reduce_input must be set to `None` when using a tagger decoder for your output feature. "
f"Setting reduce_input to `None` for `{output_feature.name}`."
)
output_feature.reduce_input = None
def set_llm_parameters(config: "ModelConfig") -> None:
if config.model_type != MODEL_LLM:
return
# Set preprocessing parameters for text features for LLM model type
_set_llm_tokenizers(config)
# Set max_new_tokens in generation config to the max sequence length of the output features
_set_generation_max_new_tokens(config)
# HACK(Arnav): Set Mixtral target modules when using LoRA
# GitHub issue: https://github.com/ludwig-ai/ludwig/issues/3853
# PEFT PR: https://github.com/huggingface/peft/pull/1376
_set_mixtral_target_modules(config)
# HACK(Arnav): Set Phi-2 target modules when using LoRA
# GitHub issue: https://github.com/ludwig-ai/ludwig/issues/3910
# PEFT PR: https://github.com/huggingface/peft/pull/1375
_set_phi2_target_modules(config)
# HACK(Arnav): Set Phi-3 target modules when using LoRA
_set_phi3_target_modules(config)
# HACK(Arnav): Set Gemma target modules when using LoRA
# GitHub issue: https://github.com/ludwig-ai/ludwig/issues/3937
# PEFT PR: https://github.com/huggingface/peft/pull/1499
_set_gemma_target_modules(config)
def _set_llm_tokenizers(config: "ModelConfig") -> None:
"""Sets the tokenizers for the LLM model to the pretrained model name or path. This ensures that they use the
correct shared vocabulary from the tokenizer.
This also ensures padding is correctly set to left padding to prevent the LLM from trying to continue to sequence
based on the right padding tokens, which might exist based on sequence length.
"""
pretrained_model_name_or_path = config.base_model
if not isinstance(pretrained_model_name_or_path, str) or pretrained_model_name_or_path is None:
raise ValueError("Must set `base_model` when using the LLM model.")
for input_feature in config.input_features:
if input_feature.type == TEXT:
input_feature.preprocessing.tokenizer = "hf_tokenizer"
input_feature.preprocessing.pretrained_model_name_or_path = pretrained_model_name_or_path
input_feature.preprocessing.padding = "left"
for output_feature in config.output_features:
if output_feature.type == TEXT:
# Add tokenizer parameters to preprocessing so it can be used during post processing
output_feature.preprocessing.tokenizer = "hf_tokenizer"
output_feature.preprocessing.pretrained_model_name_or_path = pretrained_model_name_or_path
output_feature.preprocessing.padding = "left"
# Add tokenizer parameters to decoder so it can be used during the forward pass
output_feature.decoder.pretrained_model_name_or_path = pretrained_model_name_or_path
output_feature.decoder.max_new_tokens = config.generation.max_new_tokens
elif output_feature.type == CATEGORY:
# Tokenizer parameters
output_feature.decoder.tokenizer = "hf_tokenizer"
output_feature.decoder.pretrained_model_name_or_path = pretrained_model_name_or_path
# Parameters for building decoder vocabulary
output_feature.decoder.fallback_label = output_feature.preprocessing.fallback_label
def _get_maximum_possible_sequence_length(config: "ModelConfig", default_max_sequence_length: int) -> int:
"""Returns the maximum possible sequence length for the LLM model based on the model config."""
max_possible_sequence_length = default_max_sequence_length
if config.output_features[0].preprocessing.max_sequence_length is not None:
# Note: We don't need to check for max between feature.preprocessing.max_sequence_length and
# defaults.text.preprocessing.max_sequence_length because the latter is only applied to input features.
max_possible_sequence_length = max(
default_max_sequence_length, config.output_features[0].preprocessing.max_sequence_length
)
elif config.preprocessing.global_max_sequence_length is not None:
# This is not perfect since it includes tokens from both input + output features, but this at least
# ensures that max possible of the sequence length is used. It is very likely that the model learns
# to generate sequences than this value.
max_possible_sequence_length = max(
max_possible_sequence_length, config.preprocessing.global_max_sequence_length
)
elif max_possible_sequence_length == default_max_sequence_length:
# It's possible that both max_sequence_length and global_max_sequence_length are not set, in which case
# we should fall back to the window size of the pretrained model. By this point, because of schema validation
# checks, we know that the base_model exists so we can safely grab the base model's config.
# TODO (Arnav): Figure out how to factor in rope scaling factor into this calculation.
model_config = AutoConfig.from_pretrained(config.base_model)
max_possible_sequence_length = get_context_len(model_config)
# Artifically leave a buffer of half the total model window size to trade off
# runtime while likely covering a majority of the max sequence length.
max_possible_sequence_length = max_possible_sequence_length // 2
return max_possible_sequence_length
def _set_generation_max_new_tokens(config: "ModelConfig") -> None:
"""Sets the max_new_tokens parameter in the generation config to the max sequence length of the output
features.
This ensures that the generation config is set to the correct value for the LLM model type.
"""
_DEFAULT_MAX_SEQUENCE_LENGTH = LLMGenerationConfig().max_new_tokens
if config.generation.max_new_tokens != _DEFAULT_MAX_SEQUENCE_LENGTH:
# Max new tokens is explicitly set by user, so don't override
return
if config.output_features[0].type != TEXT:
# This is trickier to set for other output features, so don't override for now.
# TODO: Add better support for category output features
return
max_possible_sequence_length = _get_maximum_possible_sequence_length(config, _DEFAULT_MAX_SEQUENCE_LENGTH)
logger.info(
f"Setting generation max_new_tokens to {max_possible_sequence_length} to correspond with the max "
"sequence length assigned to the output feature or the global max sequence length. This will ensure that "
"the correct number of tokens are generated at inference time. To override this behavior, set "
"`generation.max_new_tokens` to a different value in your Ludwig config."
)
config.generation.max_new_tokens = max_possible_sequence_length
def _set_mixtral_target_modules(config: "ModelConfig") -> None:
"""If the base model is Mixtral 7x8, LoRA is enabled and the target modules are not set, set the target modules
to q_proj and v_proj."""
if config.base_model not in {"mistralai/Mixtral-8x7B-v0.1", "mistralai/Mixtral-8x7B-Instruct-v0.1"}:
return
if not config.adapter:
return
if config.adapter.type != "lora" or config.adapter.target_modules:
return
target_modules = ["q_proj", "v_proj"]
logger.info(f"Setting adapter target modules to {target_modules} for Mixtral 7x8 base model with LoRA adapter.")
config.adapter.target_modules = target_modules
def _set_phi2_target_modules(config: "ModelConfig") -> None:
"""If the base model is Phi-2, LoRA is enabled and the target modules are not set, set the target modules to
maximize performance."""
if config.base_model not in {
"microsoft/phi-1",
"microsoft/phi-1_5",
"microsoft/phi-2",
}:
return
if not config.adapter:
return
if config.adapter.type != "lora" or config.adapter.target_modules:
return
target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]
logger.info(f"Setting adapter target modules to {target_modules} for Phi-2 base model with LoRA adapter.")
config.adapter.target_modules = target_modules
def _set_phi3_target_modules(config: "ModelConfig") -> None:
if config.base_model not in {
"microsoft/Phi-3-mini-4k-instruct",
"microsoft/Phi-3-mini-128k-instruct",
}:
return
if not config.adapter:
return
if config.adapter.type != "lora" or config.adapter.target_modules:
return
target_modules = ["qkv_proj", "o_proj", "gate_up_proj", "down_proj"]
logger.info(f"Setting adapter target modules to {target_modules} for Phi-3 base model with LoRA adapter.")
config.adapter.target_modules = target_modules
def _set_gemma_target_modules(config: "ModelConfig") -> None:
"""If the base model is Gemma, LoRA is enabled and the target modules are not set, set the target modules to
maximize performance."""
if config.base_model not in {"google/gemma-2b", "google/gemma-2b-it", "google/gemma-7b", "google/gemma-7b-it"}:
return
if not config.adapter:
return
if config.adapter.type != "lora" or config.adapter.target_modules:
return
target_modules = ["q_proj", "v_proj"]
config.adapter.target_modules = target_modules
@DeveloperAPI
def contains_grid_search_parameters(hyperopt_config: HyperoptConfigDict) -> bool:
"""Returns True if any hyperopt parameter in the config is using the grid_search space."""
for _, param_info in hyperopt_config[PARAMETERS].items():
if param_info.get(SPACE, None) == GRID_SEARCH:
return True
return False
================================================
FILE: ludwig/schema/optimizers.py
================================================
from abc import ABC
from dataclasses import field
from typing import ClassVar
import torch
try:
import bitsandbytes as bnb
except Exception:
bnb = None
import ludwig.schema.utils as schema_utils
from ludwig.api_annotations import DeveloperAPI
from ludwig.error import ConfigValidationError
from ludwig.schema.metadata import OPTIMIZER_METADATA
from ludwig.schema.metadata.parameter_metadata import convert_metadata_to_json, ParameterMetadata
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils.registry import Registry
optimizer_registry = Registry()
@DeveloperAPI
def register_optimizer(name: str):
def wrap(optimizer_config: BaseOptimizerConfig):
optimizer_registry[name] = (optimizer_config.optimizer_class, optimizer_config)
return optimizer_config
return wrap
@DeveloperAPI
def get_optimizer_cls(name: str):
"""Get the optimizer schema class from the optimizer schema class registry."""
return optimizer_registry[name][1]
@DeveloperAPI
@ludwig_dataclass
class BaseOptimizerConfig(schema_utils.BaseMarshmallowConfig, ABC):
"""Base class for optimizers. Not meant to be used directly.
The dataclass format prevents arbitrary properties from being set. Consequently, in child classes, all properties
from the corresponding `torch.optim.Optimizer` class are copied over: check each class to check which attributes are
different from the torch-specified defaults.
"""
optimizer_class: ClassVar[torch.optim.Optimizer | None] = None
"Class variable pointing to the corresponding `torch.optim.Optimizer` class."
type: str
"""Name corresponding to an optimizer `ludwig.modules.optimization_modules.optimizer_registry`.
Technically mutable, but attempting to load a derived optimizer with `type` set to a mismatched value will result in
a `ValidationError`.
"""
@property
def is_paged(self) -> bool:
"""Returns True if the optimizer is a Paged optimizer."""
return False
@property
def is_8bit(self) -> bool:
"""Returns True if the optimizer is an 8-bit optimizer."""
return False
@DeveloperAPI
@register_optimizer(name="sgd")
@ludwig_dataclass
class SGDOptimizerConfig(BaseOptimizerConfig):
"""Parameters for stochastic gradient descent."""
optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.SGD
"""Points to `torch.optim.SGD`."""
type: str = schema_utils.ProtectedString("sgd")
"""Must be 'sgd' - corresponds to name in `ludwig.modules.optimization_modules.optimizer_registry` (default:
'sgd')"""
# Defaults taken from https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD :
momentum: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Momentum factor.",
parameter_metadata=OPTIMIZER_METADATA["momentum"],
)
weight_decay: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Weight decay ($L2$ penalty).",
parameter_metadata=OPTIMIZER_METADATA["weight_decay"],
)
dampening: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Dampening for momentum.",
parameter_metadata=OPTIMIZER_METADATA["dampening"],
)
nesterov: bool = schema_utils.Boolean(
default=False,
description="Enables Nesterov momentum.",
parameter_metadata=OPTIMIZER_METADATA["nesterov"],
)
if bnb is not None:
@DeveloperAPI
@register_optimizer(name="sgd_8bit")
@ludwig_dataclass
class SGD8BitOptimizerConfig(SGDOptimizerConfig):
"""Parameters for stochastic gradient descent."""
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.SGD8bit
type: str = schema_utils.ProtectedString("sgd_8bit")
block_wise: bool = schema_utils.Boolean(
default=False,
description="Whether to use block wise update.",
)
percentile_clipping: int = schema_utils.IntegerRange(
default=100,
min=0,
max=100,
description="Percentile clipping.",
)
@property
def is_8bit(self) -> bool:
return True
@DeveloperAPI
@register_optimizer(name="lbfgs")
@ludwig_dataclass
class LBFGSOptimizerConfig(BaseOptimizerConfig):
"""Parameters for stochastic gradient descent."""
optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.LBFGS
"""Points to `torch.optim.LBFGS`."""
type: str = schema_utils.ProtectedString("lbfgs")
"""Must be 'lbfgs' - corresponds to name in `ludwig.modules.optimization_modules.optimizer_registry` (default:
'lbfgs')"""
# Defaults taken from https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html#torch.optim.LBFGS
max_iter: int = schema_utils.Integer(
default=20,
description="Maximum number of iterations per optimization step.",
parameter_metadata=OPTIMIZER_METADATA["max_iter"],
)
max_eval: int = schema_utils.Integer(
default=None,
allow_none=True,
description="Maximum number of function evaluations per optimization step. Default: `max_iter` * 1.25.",
parameter_metadata=OPTIMIZER_METADATA["max_eval"],
)
tolerance_grad: float = schema_utils.NonNegativeFloat(
default=1e-07,
description="Termination tolerance on first order optimality.",
parameter_metadata=OPTIMIZER_METADATA["tolerance_grad"],
)
tolerance_change: float = schema_utils.NonNegativeFloat(
default=1e-09,
description="Termination tolerance on function value/parameter changes.",
parameter_metadata=OPTIMIZER_METADATA["tolerance_change"],
)
history_size: int = schema_utils.Integer(
default=100, description="Update history size.", parameter_metadata=OPTIMIZER_METADATA["history_size"]
)
line_search_fn: str = schema_utils.StringOptions(
["strong_wolfe"],
default=None,
allow_none=True,
description="Line search function to use.",
parameter_metadata=OPTIMIZER_METADATA["line_search_fn"],
)
@DeveloperAPI
@register_optimizer(name="adam")
@ludwig_dataclass
class AdamOptimizerConfig(BaseOptimizerConfig):
"""Parameters for adam optimization."""
optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.Adam
"""Points to `torch.optim.Adam`."""
type: str = schema_utils.ProtectedString("adam")
"""Must be 'adam' - corresponds to name in `ludwig.modules.optimization_modules.optimizer_registry`
(default: 'adam')"""
# Defaults taken from https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam :
betas: tuple[float, float] = schema_utils.FloatRangeTupleDataclassField(
default=(0.9, 0.999),
description="Coefficients used for computing running averages of gradient and its square.",
parameter_metadata=OPTIMIZER_METADATA["betas"],
)
eps: float = schema_utils.NonNegativeFloat(
default=1e-08,
description="Term added to the denominator to improve numerical stability.",
parameter_metadata=OPTIMIZER_METADATA["eps"],
)
weight_decay: float = schema_utils.NonNegativeFloat(
default=0.0, description="Weight decay (L2 penalty).", parameter_metadata=OPTIMIZER_METADATA["weight_decay"]
)
amsgrad: bool = schema_utils.Boolean(
default=False,
description="Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam "
"and Beyond'.",
parameter_metadata=OPTIMIZER_METADATA["amsgrad"],
)
if bnb is not None:
@DeveloperAPI
@register_optimizer(name="adam_8bit")
@ludwig_dataclass
class Adam8BitOptimizerConfig(AdamOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.Adam8bit
type: str = schema_utils.ProtectedString("adam_8bit")
block_wise: bool = schema_utils.Boolean(
default=True,
description="Whether to use block wise update.",
)
percentile_clipping: int = schema_utils.IntegerRange(
default=100,
min=0,
max=100,
description="Percentile clipping.",
)
@property
def is_8bit(self) -> bool:
return True
@DeveloperAPI
@register_optimizer(name="paged_adam")
@ludwig_dataclass
class PagedAdamOptimizerConfig(Adam8BitOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.PagedAdam
type: str = schema_utils.ProtectedString("paged_adam")
@property
def is_paged(self) -> bool:
return True
@property
def is_8bit(self) -> bool:
return False
@DeveloperAPI
@register_optimizer(name="paged_adam_8bit")
@ludwig_dataclass
class PagedAdam8BitOptimizerConfig(PagedAdamOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.PagedAdam8bit
type: str = schema_utils.ProtectedString("paged_adam_8bit")
@property
def is_8bit(self) -> bool:
return True
@DeveloperAPI
@register_optimizer(name="adamw")
@ludwig_dataclass
class AdamWOptimizerConfig(BaseOptimizerConfig):
"""Parameters for adamw optimization."""
optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.AdamW
"""Points to `torch.optim.AdamW`."""
type: str = schema_utils.ProtectedString("adamw")
"""Must be 'adamw' - corresponds to name in `ludwig.modules.optimization_modules.optimizer_registry`
(default: 'adamw')"""
# Defaults taken from https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam :
betas: tuple[float, float] = schema_utils.FloatRangeTupleDataclassField(
default=(0.9, 0.999),
description="Coefficients used for computing running averages of gradient and its square.",
parameter_metadata=OPTIMIZER_METADATA["betas"],
)
eps: float = schema_utils.NonNegativeFloat(
default=1e-08,
description="Term added to the denominator to improve numerical stability.",
parameter_metadata=OPTIMIZER_METADATA["eps"],
)
weight_decay: float = schema_utils.NonNegativeFloat(
default=0.0, description="Weight decay ($L2$ penalty).", parameter_metadata=OPTIMIZER_METADATA["weight_decay"]
)
amsgrad: bool = schema_utils.Boolean(
default=False,
description="Whether to use the AMSGrad variant of this algorithm from the paper 'On the Convergence of Adam "
"and Beyond'. ",
parameter_metadata=OPTIMIZER_METADATA["amsgrad"],
)
if bnb is not None:
@DeveloperAPI
@register_optimizer(name="adamw_8bit")
@ludwig_dataclass
class AdamW8BitOptimizerConfig(AdamWOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.AdamW8bit
type: str = schema_utils.ProtectedString("adamw_8bit")
block_wise: bool = schema_utils.Boolean(
default=True,
description="Whether to use block wise update.",
)
percentile_clipping: int = schema_utils.IntegerRange(
default=100,
min=0,
max=100,
description="Percentile clipping.",
)
@property
def is_8bit(self) -> bool:
return True
@DeveloperAPI
@register_optimizer(name="paged_adamw")
@ludwig_dataclass
class PagedAdamWOptimizerConfig(AdamW8BitOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.PagedAdamW
type: str = schema_utils.ProtectedString("paged_adamw")
@property
def is_paged(self) -> bool:
return True
@property
def is_8bit(self) -> bool:
return False
@DeveloperAPI
@register_optimizer(name="paged_adamw_8bit")
@ludwig_dataclass
class PagedAdamW8BitOptimizerConfig(PagedAdamWOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.PagedAdamW8bit
type: str = schema_utils.ProtectedString("paged_adamw_8bit")
@property
def is_8bit(self) -> bool:
return True
@DeveloperAPI
@register_optimizer(name="adadelta")
@ludwig_dataclass
class AdadeltaOptimizerConfig(BaseOptimizerConfig):
"""Parameters for adadelta optimization."""
optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.Adadelta
"""Points to `torch.optim.Adadelta`."""
type: str = schema_utils.ProtectedString("adadelta")
"""Must be 'adadelta' - corresponds to name in `ludwig.modules.optimization_modules.optimizer_registry`
(default: 'adadelta')"""
# Defaults taken from https://pytorch.org/docs/stable/generated/torch.optim.Adadelta.html#torch.optim.Adadelta :
rho: float = schema_utils.FloatRange(
default=0.9,
min=0,
max=1,
description="Coefficient used for computing a running average of squared gradients.",
parameter_metadata=OPTIMIZER_METADATA["rho"],
)
eps: float = schema_utils.NonNegativeFloat(
default=1e-06,
description="Term added to the denominator to improve numerical stability.",
parameter_metadata=OPTIMIZER_METADATA["eps"],
)
weight_decay: float = schema_utils.NonNegativeFloat(
default=0.0, description="Weight decay ($L2$ penalty).", parameter_metadata=OPTIMIZER_METADATA["weight_decay"]
)
@DeveloperAPI
@register_optimizer(name="adagrad")
@ludwig_dataclass
class AdagradOptimizerConfig(BaseOptimizerConfig):
"""Parameters for adagrad optimization."""
# Example docstring
optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.Adagrad
"""Points to `torch.optim.Adagrad`."""
type: str = schema_utils.ProtectedString("adagrad")
"""Must be 'adagrad' - corresponds to name in `ludwig.modules.optimization_modules.optimizer_registry`
(default: 'adagrad')"""
# Defaults taken from https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html#torch.optim.Adagrad :
initial_accumulator_value: float = schema_utils.NonNegativeFloat(
default=0, description="", parameter_metadata=OPTIMIZER_METADATA["initial_accumulator_value"]
)
lr_decay: float = schema_utils.FloatRange(
default=0, description="Learning rate decay.", parameter_metadata=OPTIMIZER_METADATA["lr_decay"]
)
weight_decay: float = schema_utils.FloatRange(
default=0, description="Weight decay ($L2$ penalty).", parameter_metadata=OPTIMIZER_METADATA["weight_decay"]
)
eps: float = schema_utils.FloatRange(
default=1e-10,
description="Term added to the denominator to improve numerical stability.",
parameter_metadata=OPTIMIZER_METADATA["eps"],
)
if bnb is not None:
@DeveloperAPI
@register_optimizer(name="adagrad_8bit")
@ludwig_dataclass
class Adagrad8BitOptimizerConfig(AdagradOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.Adagrad8bit
type: str = schema_utils.ProtectedString("adagrad_8bit")
block_wise: bool = schema_utils.Boolean(
default=True,
description="Whether to use block wise update.",
)
percentile_clipping: int = schema_utils.IntegerRange(
default=100,
min=0,
max=100,
description="Percentile clipping.",
)
@property
def is_8bit(self) -> bool:
return True
@DeveloperAPI
@register_optimizer(name="adamax")
@ludwig_dataclass
class AdamaxOptimizerConfig(BaseOptimizerConfig):
"""Parameters for adamax optimization."""
optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.Adamax
"""Points to `torch.optim.Adamax`."""
type: str = schema_utils.ProtectedString("adamax")
"""Must be 'adamax' - corresponds to name in `ludwig.modules.optimization_modules.optimizer_registry`
(default: 'adamax')"""
# Defaults taken from https://pytorch.org/docs/stable/generated/torch.optim.Adamax.html#torch.optim.Adamax :
betas: tuple[float, float] = schema_utils.FloatRangeTupleDataclassField(
default=(0.9, 0.999),
description="Coefficients used for computing running averages of gradient and its square.",
parameter_metadata=OPTIMIZER_METADATA["betas"],
)
eps: float = schema_utils.NonNegativeFloat(
default=1e-08,
description="Term added to the denominator to improve numerical stability.",
parameter_metadata=OPTIMIZER_METADATA["eps"],
)
weight_decay: float = schema_utils.NonNegativeFloat(
default=0.0, description="Weight decay ($L2$ penalty).", parameter_metadata=OPTIMIZER_METADATA["weight_decay"]
)
# NOTE: keep ftrl and nadam optimizers out of registry:
# @register_optimizer(name="ftrl")
@DeveloperAPI
@ludwig_dataclass
class FtrlOptimizerConfig(BaseOptimizerConfig):
# optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.Ftrl
type: str = schema_utils.ProtectedString("ftrl")
learning_rate_power: float = schema_utils.FloatRange(
default=-0.5, max=0, parameter_metadata=OPTIMIZER_METADATA["learning_rate_power"]
)
initial_accumulator_value: float = schema_utils.NonNegativeFloat(
default=0.1, parameter_metadata=OPTIMIZER_METADATA["initial_accumulator_value"]
)
l1_regularization_strength: float = schema_utils.NonNegativeFloat(
default=0.0, parameter_metadata=OPTIMIZER_METADATA["l1_regularization_strength"]
)
l2_regularization_strength: float = schema_utils.NonNegativeFloat(
default=0.0, parameter_metadata=OPTIMIZER_METADATA["l2_regularization_strength"]
)
@DeveloperAPI
@register_optimizer(name="nadam")
@ludwig_dataclass
class NadamOptimizerConfig(BaseOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.NAdam
"""Points to `torch.optim.NAdam`."""
type: str = schema_utils.ProtectedString("nadam")
# Defaults taken from https://pytorch.org/docs/stable/generated/torch.optim.NAdam.html#torch.optim.NAdam :
betas: tuple[float, float] = schema_utils.FloatRangeTupleDataclassField(
default=(0.9, 0.999),
description="Coefficients used for computing running averages of gradient and its square.",
parameter_metadata=OPTIMIZER_METADATA["betas"],
)
eps: float = schema_utils.NonNegativeFloat(
default=1e-08,
description="Term added to the denominator to improve numerical stability.",
parameter_metadata=OPTIMIZER_METADATA["eps"],
)
weight_decay: float = schema_utils.NonNegativeFloat(
default=0.0, description="Weight decay ($L2$ penalty).", parameter_metadata=OPTIMIZER_METADATA["weight_decay"]
)
momentum_decay: float = schema_utils.NonNegativeFloat(
default=4e-3, description="Momentum decay.", parameter_metadata=OPTIMIZER_METADATA["momentum_decay"]
)
@DeveloperAPI
@register_optimizer(name="rmsprop")
@ludwig_dataclass
class RMSPropOptimizerConfig(BaseOptimizerConfig):
"""Parameters for rmsprop optimization."""
optimizer_class: ClassVar[torch.optim.Optimizer] = torch.optim.RMSprop
"""Points to `torch.optim.RMSprop`."""
type: str = schema_utils.ProtectedString("rmsprop")
"""Must be 'rmsprop' - corresponds to name in `ludwig.modules.optimization_modules.optimizer_registry`
(default: 'rmsprop')"""
# Defaults taken from https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html#torch.optim.RMSprop:
momentum: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Momentum factor.",
parameter_metadata=OPTIMIZER_METADATA["momentum"],
)
alpha: float = schema_utils.NonNegativeFloat(
default=0.99,
description="Smoothing constant.",
parameter_metadata=OPTIMIZER_METADATA["alpha"],
)
eps: float = schema_utils.NonNegativeFloat(
default=1e-08,
description="Term added to the denominator to improve numerical stability.",
parameter_metadata=OPTIMIZER_METADATA["eps"],
)
centered: bool = schema_utils.Boolean(
default=False,
description="If True, computes the centered RMSProp, and the gradient is normalized by an estimation of its "
"variance.",
parameter_metadata=OPTIMIZER_METADATA["centered"],
)
weight_decay: float = schema_utils.NonNegativeFloat(default=0.0, description="Weight decay ($L2$ penalty).")
if bnb is not None:
@DeveloperAPI
@register_optimizer(name="rmsprop_8bit")
@ludwig_dataclass
class RMSProp8BitOptimizerConfig(RMSPropOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.RMSprop8bit
type: str = schema_utils.ProtectedString("rmsprop_8bit")
block_wise: bool = schema_utils.Boolean(
default=True,
description="Whether to use block wise update.",
)
percentile_clipping: int = schema_utils.IntegerRange(
default=100,
min=0,
max=100,
description="Percentile clipping.",
)
@property
def is_8bit(self) -> bool:
return True
if bnb is not None:
@DeveloperAPI
@register_optimizer(name="lamb")
@ludwig_dataclass
class LAMBOptimizerConfig(BaseOptimizerConfig):
"""Layer-wise Adaptive Moments optimizer for Batch training.
Paper: https://arxiv.org/pdf/1904.00962.pdf
"""
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.LAMB
type: str = schema_utils.ProtectedString("lamb")
bias_correction: bool = schema_utils.Boolean(
default=True,
)
betas: tuple[float, float] = schema_utils.FloatRangeTupleDataclassField(
default=(0.9, 0.999),
description="Coefficients used for computing running averages of gradient and its square.",
parameter_metadata=OPTIMIZER_METADATA["betas"],
)
eps: float = schema_utils.NonNegativeFloat(
default=1e-08,
description="Term added to the denominator to improve numerical stability.",
parameter_metadata=OPTIMIZER_METADATA["eps"],
)
weight_decay: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Weight decay (L2 penalty).",
parameter_metadata=OPTIMIZER_METADATA["weight_decay"],
)
amsgrad: bool = schema_utils.Boolean(
default=False,
description=(
"Whether to use the AMSGrad variant of this algorithm from the paper "
"'On the Convergence of Adam and Beyond'."
),
parameter_metadata=OPTIMIZER_METADATA["amsgrad"],
)
adam_w_mode: bool = schema_utils.Boolean(
default=True,
description="Whether to use the AdamW mode of this algorithm from the paper "
"'Decoupled Weight Decay Regularization'.",
)
percentile_clipping: int = schema_utils.IntegerRange(
default=100,
min=0,
max=100,
description="Percentile clipping.",
)
block_wise: bool = schema_utils.Boolean(
default=False,
description="Whether to use block wise update.",
)
max_unorm: float = schema_utils.FloatRange(
default=1.0,
min=0.0,
max=1.0,
)
@DeveloperAPI
@register_optimizer(name="lamb_8bit")
@ludwig_dataclass
class LAMB8BitOptimizerConfig(LAMBOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.LAMB8bit
type: str = schema_utils.ProtectedString("lamb_8bit")
@property
def is_8bit(self) -> bool:
return True
if bnb is not None:
@DeveloperAPI
@register_optimizer(name="lars")
@ludwig_dataclass
class LARSOptimizerConfig(BaseOptimizerConfig):
"""Layerwise Adaptive Rate Scaling.
Paper: https://arxiv.org/pdf/1708.03888.pdf
"""
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.LARS
type: str = schema_utils.ProtectedString("lars")
# 0.9 taken from the original paper - momentum requires a non zero value
# https://arxiv.org/pdf/1708.03888v3.pdf
momentum: float = schema_utils.FloatRange(
default=0.9,
min=0.0,
max=1.0,
min_inclusive=False,
description="Momentum factor.",
parameter_metadata=OPTIMIZER_METADATA["momentum"],
)
dampening: float = schema_utils.FloatRange(
default=0.0,
min=0.0,
max=1.0,
description="Dampening for momentum.",
parameter_metadata=OPTIMIZER_METADATA["dampening"],
)
weight_decay: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Weight decay (L2 penalty).",
parameter_metadata=OPTIMIZER_METADATA["weight_decay"],
)
nesterov: bool = schema_utils.Boolean(
default=False,
description="Enables Nesterov momentum.",
parameter_metadata=OPTIMIZER_METADATA["nesterov"],
)
percentile_clipping: int = schema_utils.IntegerRange(
default=100,
min=0,
max=100,
description="Percentile clipping.",
)
max_unorm: float = schema_utils.FloatRange(
default=1.0,
min=0.0,
max=1.0,
)
@DeveloperAPI
@register_optimizer(name="lars_8bit")
@ludwig_dataclass
class LARS8BitOptimizerConfig(LARSOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.LARS8bit
type: str = schema_utils.ProtectedString("lars_8bit")
@property
def is_8bit(self) -> bool:
return True
if bnb is not None:
@DeveloperAPI
@register_optimizer(name="lion")
@ludwig_dataclass
class LIONOptimizerConfig(BaseOptimizerConfig):
"""Evolved Sign Momentum.
Paper: https://arxiv.org/pdf/2302.06675.pdf
"""
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.Lion
type: str = schema_utils.ProtectedString("lion")
betas: tuple[float, float] = schema_utils.FloatRangeTupleDataclassField(
default=(0.9, 0.999),
description="Coefficients used for computing running averages of gradient and its square.",
parameter_metadata=OPTIMIZER_METADATA["betas"],
)
weight_decay: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Weight decay (L2 penalty).",
parameter_metadata=OPTIMIZER_METADATA["weight_decay"],
)
percentile_clipping: int = schema_utils.IntegerRange(
default=100,
min=0,
max=100,
description="Percentile clipping.",
)
block_wise: bool = schema_utils.Boolean(
default=True,
description="Whether to use block wise update.",
)
@DeveloperAPI
@register_optimizer(name="lion_8bit")
@ludwig_dataclass
class LION8BitOptimizerConfig(LIONOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.Lion8bit
type: str = schema_utils.ProtectedString("lion_8bit")
@property
def is_8bit(self) -> bool:
return True
@DeveloperAPI
@register_optimizer(name="paged_lion")
@ludwig_dataclass
class PagedLionOptimizerConfig(LIONOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.PagedLion
type: str = schema_utils.ProtectedString("paged_lion")
@property
def is_paged(self) -> bool:
return True
@DeveloperAPI
@register_optimizer(name="paged_lion_8bit")
@ludwig_dataclass
class PagedLion8BitOptimizerConfig(PagedLionOptimizerConfig):
optimizer_class: ClassVar[torch.optim.Optimizer] = bnb.optim.PagedLion8bit
type: str = schema_utils.ProtectedString("paged_lion_8bit")
@property
def is_8bit(self) -> bool:
return True
@DeveloperAPI
def get_optimizer_conds():
"""Returns a JSON schema of conditionals to validate against optimizer types defined in
`ludwig.modules.optimization_modules.optimizer_registry`."""
conds = []
for optimizer in optimizer_registry:
optimizer_cls = optimizer_registry[optimizer][1]
other_props = schema_utils.unload_jsonschema_from_marshmallow_class(optimizer_cls)["properties"]
schema_utils.remove_duplicate_fields(other_props)
preproc_cond = schema_utils.create_cond(
{"type": optimizer},
other_props,
)
conds.append(preproc_cond)
return conds
@DeveloperAPI
def OptimizerDataclassField(default="adam", description="", parameter_metadata: ParameterMetadata = None):
"""Custom dataclass field that when used inside of a dataclass will allow any optimizer in
`ludwig.modules.optimization_modules.optimizer_registry`.
Sets default optimizer to 'adam'.
:param default: Dict specifying an optimizer with a `type` field and its associated parameters. Will attempt to use
`type` to load optimizer from registry with given params. (default: {"type": "adam"}).
:return: Initialized dataclass field that converts untyped dicts with params to optimizer dataclass instances.
"""
class OptimizerSelection(schema_utils.TypeSelection):
"""Custom marshmallow field that deserializes a dict to a valid optimizer from
`ludwig.modules.optimization_modules.optimizer_registry` and creates a corresponding `oneOf` JSON schema
for external usage."""
def __init__(self):
super().__init__(
registry=optimizer_registry,
default_value=default,
description=description,
parameter_metadata=parameter_metadata,
)
def get_schema_from_registry(self, key: str) -> type[schema_utils.BaseMarshmallowConfig]:
return get_optimizer_cls(key)
def _jsonschema_type_mapping(self):
# Note that this uses the same conditional pattern as combiners:
return {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": list(optimizer_registry.keys()),
"default": default,
"description": "The type of optimizer to use during the learning process",
},
},
"title": "optimizer_options",
"allOf": get_optimizer_conds(),
"required": ["type"],
"description": description,
}
return OptimizerSelection().get_default_field()
@DeveloperAPI
@ludwig_dataclass
class GradientClippingConfig(schema_utils.BaseMarshmallowConfig):
"""Dataclass that holds gradient clipping parameters."""
clipglobalnorm: float | None = schema_utils.FloatRange(
default=0.5,
allow_none=True,
description="Maximum allowed norm of the gradients",
parameter_metadata=OPTIMIZER_METADATA["gradient_clipping"],
)
# TODO(travis): is this redundant with `clipglobalnorm`?
clipnorm: float | None = schema_utils.FloatRange(
default=None,
allow_none=True,
description="Maximum allowed norm of the gradients",
parameter_metadata=OPTIMIZER_METADATA["gradient_clipping"],
)
clipvalue: float | None = schema_utils.FloatRange(
default=None,
allow_none=True,
description="Maximum allowed value of the gradients",
parameter_metadata=OPTIMIZER_METADATA["gradient_clipping"],
)
@DeveloperAPI
def GradientClippingDataclassField(description: str, default: dict = {}):
"""Returns custom dataclass field for `ludwig.modules.optimization_modules.GradientClippingConfig`. Allows
`None` by default.
:param description: Description of the gradient dataclass field
:param default: dict that specifies clipping param values that will be loaded by its schema class (default: {}).
"""
allow_none = True
class GradientClippingMarshmallowField(schema_utils.LudwigSchemaField):
"""Custom field class for gradient clipping.
Deserializes a dict to a valid instance of `ludwig.modules.optimization_modules.GradientClippingConfig` and
creates a corresponding JSON schema for external usage.
"""
def _deserialize(self, value, attr, data, **kwargs):
if value is None:
return value
if isinstance(value, dict):
try:
return GradientClippingConfig.Schema().load(value)
except (TypeError, ConfigValidationError):
raise ConfigValidationError(
f"Invalid params for gradient clipping: {value}, see GradientClippingConfig class."
)
raise ConfigValidationError("Field should be None or dict")
def _jsonschema_type_mapping(self):
return {
"oneOf": [
{"type": "null", "title": "disabled", "description": "Disable gradient clipping."},
{
**schema_utils.unload_jsonschema_from_marshmallow_class(GradientClippingConfig),
"title": "enabled_options",
},
],
"title": "gradient_clipping_options",
"description": description,
}
if not isinstance(default, dict):
raise ConfigValidationError(f"Invalid default: `{default}`")
def load_default():
return GradientClippingConfig.Schema().load(default)
dump_default = GradientClippingConfig.Schema().dump(default)
return field(
metadata={
"marshmallow_field": GradientClippingMarshmallowField(
allow_none=allow_none,
load_default=load_default,
dump_default=dump_default,
metadata={
"description": description,
"parameter_metadata": convert_metadata_to_json(OPTIMIZER_METADATA["gradient_clipping"]),
},
)
},
default_factory=load_default,
)
================================================
FILE: ludwig/schema/preprocessing.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import RANDOM
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import PREPROCESSING_METADATA
from ludwig.schema.split import BaseSplitConfig, SplitDataclassField
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class PreprocessingConfig(schema_utils.BaseMarshmallowConfig):
"""Global preprocessing config is a dataclass that configures the parameters used for global preprocessing."""
sample_ratio: float = schema_utils.NonNegativeFloat(
default=1.0,
description="The ratio of the dataset to use. For instance, if 0.5, half of the dataset "
"provided will be used.",
parameter_metadata=PREPROCESSING_METADATA["sample_ratio"],
)
sample_size: float = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="The maximum number of samples from the dataset to use. Cannot be set if sample_ratio is set to be "
"< 1.0. If sample_ratio is set to 1.0, this will override the number of samples to used.",
parameter_metadata=PREPROCESSING_METADATA["sample_size"],
)
oversample_minority: float = schema_utils.NonNegativeFloat(
default=None,
allow_none=True,
description="If not None, the minority class will be oversampled to reach the specified ratio respective to "
"the majority class. ",
parameter_metadata=PREPROCESSING_METADATA["oversample_minority"],
)
undersample_majority: float = schema_utils.NonNegativeFloat(
default=None,
allow_none=True,
description="If not None, the majority class will be undersampled to reach the specified ratio respective "
"to the minority class. ",
parameter_metadata=PREPROCESSING_METADATA["undersample_majority"],
)
split: BaseSplitConfig = SplitDataclassField(
default=RANDOM,
)
global_max_sequence_length: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Specifically for LLMs. This is the maximum length of the input sequence going into the model's "
"forward pass during training. Sequences will be truncated to this length after merging inputs and targets. "
"If not set, the total length of the merged input and target token sequences will be used.",
parameter_metadata=PREPROCESSING_METADATA["global_max_sequence_length"],
)
@DeveloperAPI
class PreprocessingField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(PreprocessingConfig)
================================================
FILE: ludwig/schema/profiler.py
================================================
from dataclasses import field
import ludwig.schema.utils as schema_utils
from ludwig.api_annotations import DeveloperAPI
from ludwig.error import ConfigValidationError
from ludwig.schema.utils import ludwig_dataclass
@DeveloperAPI
@ludwig_dataclass
class ProfilerConfig(schema_utils.BaseMarshmallowConfig):
"""Dataclass that holds profiling parameters for torch profile scheduler.
The profiler will skip the first skip_first steps, then wait for wait steps, then do the warmup for the next warmup
steps, then do the active recording for the next active steps and then repeat the cycle starting with wait steps.
The optional number of cycles is specified with the repeat parameter, the zero value means that the cycles will
continue until the profiling is finished.
"""
wait: int = schema_utils.IntegerRange(
default=1,
min=0,
description="The number of steps to wait profiling.",
)
warmup: int = schema_utils.IntegerRange(
default=1,
min=0,
description="The number of steps for profiler warmup after waiting finishes.",
)
active: int = schema_utils.IntegerRange(
default=3,
min=0,
description="The number of steps that are actively recorded. Values more than 10 wil dramatically slow down "
"tensorboard loading.",
)
repeat: int = schema_utils.IntegerRange(
default=5,
min=0,
description="The optional number of profiling cycles. Use 0 to profile the entire training run.",
)
skip_first: int = schema_utils.IntegerRange(
default=0,
min=0,
max=100,
description="The number of steps to skip in the beginning of training.",
)
@DeveloperAPI
def ProfilerDataclassField(description: str, default: dict = {}):
"""Returns custom dataclass field for `ludwig.modules.profiler.ProfilerConfig`. Allows `None` by default.
:param description: Description of the torch profiler field
:param default: dict that specifies clipping param values that will be loaded by its schema class (default: {}).
"""
allow_none = True
class ProfilingMarshmallowField(schema_utils.LudwigSchemaField):
"""Custom field class for the torch profiler.
Deserializes a dict to a valid instance of `ludwig.modules.optimization_modules.ProfilerConfig` and
creates a corresponding JSON schema for external usage.
"""
def _deserialize(self, value, attr, data, **kwargs):
if value is None:
return value
if isinstance(value, dict):
try:
return ProfilerConfig.Schema().load(value)
except (TypeError, ConfigValidationError):
raise ConfigValidationError(
f"Invalid params for profiling config: {value}, see ProfilerConfig class."
)
raise ConfigValidationError("Field should be None or dict")
def _jsonschema_type_mapping(self):
return {
**schema_utils.unload_jsonschema_from_marshmallow_class(ProfilerConfig),
"title": "profiler_options",
"description": description,
}
if not isinstance(default, dict):
raise ConfigValidationError(f"Invalid default: `{default}`")
def load_default():
return ProfilerConfig.Schema().load(default)
dump_default = ProfilerConfig.Schema().dump(default)
return field(
metadata={
"marshmallow_field": ProfilingMarshmallowField(
allow_none=allow_none,
load_default=load_default,
dump_default=dump_default,
metadata={
"description": description,
"parameter_metadata": None,
},
)
},
default_factory=load_default,
)
================================================
FILE: ludwig/schema/split.py
================================================
from dataclasses import Field
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import SPLIT, TYPE
from ludwig.schema import utils as schema_utils
from ludwig.schema.metadata import PREPROCESSING_METADATA
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils.registry import Registry
split_config_registry = Registry()
DEFAULT_PROBABILITIES = [0.7, 0.1, 0.2]
@DeveloperAPI
def get_split_cls(name: str):
return split_config_registry[name]
@DeveloperAPI
@ludwig_dataclass
class BaseSplitConfig(schema_utils.BaseMarshmallowConfig):
"""This Dataclass is a base schema for the nested split config under preprocessing."""
type: str
"Name corresponding to the splitting type."
@DeveloperAPI
@split_config_registry.register("random")
@ludwig_dataclass
class RandomSplitConfig(BaseSplitConfig):
"""This Dataclass generates a schema for the random splitting config."""
type: str = schema_utils.ProtectedString(
"random",
description="Type of splitting to use during preprocessing.",
)
probabilities: list = schema_utils.List(
list_type=float,
default=DEFAULT_PROBABILITIES,
description="Probabilities for splitting data into train, validation, and test sets.",
parameter_metadata=PREPROCESSING_METADATA["split_probabilities"],
)
@DeveloperAPI
@split_config_registry.register("fixed")
@ludwig_dataclass
class FixedSplitConfig(BaseSplitConfig):
"""This Dataclass generates a schema for the fixed splitting config."""
type: str = schema_utils.ProtectedString(
"fixed",
description="Type of splitting to use during preprocessing.",
)
column: str = schema_utils.String(
default=SPLIT,
allow_none=False,
description="The column name to use for fixed splitting.",
parameter_metadata=PREPROCESSING_METADATA["column"],
)
@DeveloperAPI
@split_config_registry.register("stratify")
@ludwig_dataclass
class StratifySplitConfig(BaseSplitConfig):
"""This Dataclass generates a schema for the fixed splitting config."""
type: str = schema_utils.ProtectedString(
"stratify",
description="Type of splitting to use during preprocessing.",
)
column: str = schema_utils.String(
default=None,
allow_none=True,
description="The column name to base the stratified splitting on.",
parameter_metadata=PREPROCESSING_METADATA["column"],
)
probabilities: list = schema_utils.List(
list_type=float,
default=DEFAULT_PROBABILITIES,
description="Probabilities for splitting data into train, validation, and test sets.",
parameter_metadata=PREPROCESSING_METADATA["split_probabilities"],
)
@DeveloperAPI
@split_config_registry.register("datetime")
@ludwig_dataclass
class DateTimeSplitConfig(BaseSplitConfig):
"""This Dataclass generates a schema for the fixed splitting config."""
type: str = schema_utils.ProtectedString(
"datetime",
description="Type of splitting to use during preprocessing.",
)
column: str = schema_utils.String(
default=None,
allow_none=True,
description="The column name to perform datetime splitting on.",
parameter_metadata=PREPROCESSING_METADATA["column"],
)
probabilities: list = schema_utils.List(
list_type=float,
default=DEFAULT_PROBABILITIES,
description="Proportion of data to split into train, validation, and test sets.",
parameter_metadata=PREPROCESSING_METADATA["split_probabilities"],
)
@DeveloperAPI
@split_config_registry.register("hash")
@ludwig_dataclass
class HashSplitConfig(BaseSplitConfig):
"""This Dataclass generates a schema for the hash splitting config.
This is useful for deterministically splitting on a unique ID. Even when additional rows are added to the dataset in
the future, each ID will retain its original split assignment.
This approach does not guarantee that the split proportions will be assigned exactly, but the larger the dataset,
the more closely the assignment should match the given proportions.
This approach can be used on a column with duplicates, but it will further skew the assignments of rows to splits.
"""
type: str = schema_utils.ProtectedString(
"hash",
description="Type of splitting to use during preprocessing.",
)
column: str = schema_utils.String(
default=None,
allow_none=True,
description="The column name to perform hash splitting on.",
parameter_metadata=PREPROCESSING_METADATA["column"],
)
probabilities: list = schema_utils.List(
list_type=float,
default=DEFAULT_PROBABILITIES,
description="Proportion of data to split into train, validation, and test sets.",
parameter_metadata=PREPROCESSING_METADATA["split_probabilities"],
)
@DeveloperAPI
def get_split_conds():
"""Returns a JSON schema of conditionals to validate against optimizer types defined in
`ludwig.modules.optimization_modules.optimizer_registry`."""
conds = []
for splitter in split_config_registry.data:
splitter_cls = split_config_registry.data[splitter]
other_props = schema_utils.unload_jsonschema_from_marshmallow_class(splitter_cls)["properties"]
schema_utils.remove_duplicate_fields(other_props, [TYPE])
splitter_cond = schema_utils.create_cond(
{"type": splitter},
other_props,
)
conds.append(splitter_cond)
return conds
@DeveloperAPI
def SplitDataclassField(default: str) -> Field:
"""Custom dataclass field that when used inside a dataclass will allow the user to specify a nested split
config.
Returns: Initialized dataclass field that converts an untyped dict with params to a split config.
"""
class SplitSelection(schema_utils.TypeSelection):
def __init__(self):
super().__init__(registry=split_config_registry.data, default_value=default)
def get_schema_from_registry(self, key: str) -> type[schema_utils.BaseMarshmallowConfig]:
return split_config_registry.data[key]
def _jsonschema_type_mapping(self):
return {
"type": "object",
"properties": {
"type": {
"type": "string",
"description": "Type of splitting to use during preprocessing.",
"enum": list(split_config_registry.data.keys()),
"default": default,
},
},
"title": "split_options",
"allOf": get_split_conds(),
}
return SplitSelection().get_default_field()
================================================
FILE: ludwig/schema/trainer.py
================================================
import re
from abc import ABC
import torch
from packaging.version import parse as parse_version
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
AUTO,
EFFECTIVE_BATCH_SIZE,
LOSS,
MAX_BATCH_SIZE,
MAX_POSSIBLE_BATCH_SIZE,
MODEL_ECD,
MODEL_LLM,
TRAINING,
)
from ludwig.error import ConfigValidationError
from ludwig.schema import utils as schema_utils
from ludwig.schema.lr_scheduler import LRSchedulerConfig, LRSchedulerDataclassField
from ludwig.schema.metadata import TRAINER_METADATA
from ludwig.schema.optimizers import (
BaseOptimizerConfig,
GradientClippingConfig,
GradientClippingDataclassField,
OptimizerDataclassField,
)
from ludwig.schema.profiler import ProfilerConfig, ProfilerDataclassField
from ludwig.schema.utils import ludwig_dataclass
from ludwig.utils.registry import Registry
_torch_200 = parse_version(torch.__version__) >= parse_version("2.0")
trainer_schema_registry = Registry()
_llm_trainer_schema_registry = Registry()
@DeveloperAPI
def register_trainer_schema(model_type: str):
def wrap(trainer_config: BaseTrainerConfig):
trainer_schema_registry[model_type] = trainer_config
return trainer_config
return wrap
@DeveloperAPI
def register_llm_trainer_schema(trainer_type: str):
def wrap(trainer_config: BaseTrainerConfig):
_llm_trainer_schema_registry[trainer_type] = trainer_config
return trainer_config
return wrap
@DeveloperAPI
def get_llm_trainer_cls(trainer_type: str):
"""Returns the adapter config class registered with the given name."""
return _llm_trainer_schema_registry[trainer_type]
@DeveloperAPI
@ludwig_dataclass
class BaseTrainerConfig(schema_utils.BaseMarshmallowConfig, ABC):
"""Common trainer parameter values."""
validation_field: str = schema_utils.String(
default=None,
allow_none=True,
description="The field for which the `validation_metric` is used for validation-related mechanics like early "
"stopping, parameter change plateaus, as well as what hyperparameter optimization uses to determine the best "
"trial. If unset (default), the first output feature is used. If explicitly specified, neither "
"`validation_field` nor `validation_metric` are overwritten.",
)
validation_metric: str = schema_utils.String(
default=None,
allow_none=True,
description=(
"Metric from `validation_field` that is used. If validation_field is not explicitly specified, this is "
"overwritten to be the first output feature type's `default_validation_metric`, consistent with "
"validation_field. If the validation_metric is specified, then we will use the first output feature that "
"produces this metric as the `validation_field`."
),
)
early_stop: int = schema_utils.IntegerRange(
default=5,
min=-1,
description=(
"Number of consecutive rounds of evaluation without any improvement on the `validation_metric` that "
"triggers training to stop. Can be set to -1, which disables early stopping entirely."
),
)
skip_all_evaluation: bool = schema_utils.Boolean(
default=False,
description=(
"Whether to skip evaluation entirely. If you are training a model with a well-known configuration on a "
"well-known dataset and are confident about the expected results, you might skip all evaluation. Moreover, "
"evaluating a model, especially on large validation or test sets, can be time-consuming."
),
)
enable_profiling: bool = schema_utils.Boolean(
default=False,
description="Whether to enable profiling of the training process using torch.profiler.profile.",
)
profiler: ProfilerConfig | None = ProfilerDataclassField(
description="Parameter values for profiling config.",
default={},
)
def can_tune_batch_size(self) -> bool:
return True
@DeveloperAPI
@register_trainer_schema(MODEL_ECD)
@ludwig_dataclass
class ECDTrainerConfig(BaseTrainerConfig):
"""Dataclass that configures most of the hyperparameters used for ECD model training."""
def __post_init__(self):
if self.compile and not _torch_200:
raise ConfigValidationError(
"Trainer param `compile: true` requires PyTorch 2.0.0 or higher. Please upgrade PyTorch and try again."
)
if self.effective_batch_size != AUTO and self.max_batch_size < self.effective_batch_size:
raise ConfigValidationError(
f"`max_batch_size` ({self.max_batch_size}) must be greater than or equal to "
f"`effective_batch_size` ({self.effective_batch_size})."
)
if self.effective_batch_size != AUTO and self.batch_size != AUTO:
if self.effective_batch_size < self.batch_size:
raise ConfigValidationError(
f"`effective_batch_size` ({self.effective_batch_size}) "
f"must be greater than or equal to `batch_size` ({self.batch_size})."
)
if self.effective_batch_size % self.batch_size != 0:
raise ConfigValidationError(
f"`effective_batch_size` ({self.effective_batch_size}) "
f"must be divisible by `batch_size` ({self.batch_size})."
)
if self.effective_batch_size != AUTO and self.gradient_accumulation_steps != AUTO:
if self.effective_batch_size < self.gradient_accumulation_steps:
raise ConfigValidationError(
f"`effective_batch_size` ({self.effective_batch_size}) must be greater than or equal to "
f"`gradient_accumulation_steps` ({self.gradient_accumulation_steps})."
)
if self.effective_batch_size % self.gradient_accumulation_steps != 0:
raise ConfigValidationError(
f"`effective_batch_size` ({self.effective_batch_size}) must be divisible by "
f"`gradient_accumulation_steps` ({self.gradient_accumulation_steps})."
)
if self.layers_to_freeze_regex:
try:
re.compile(self.layers_to_freeze_regex)
except re.error:
raise ConfigValidationError(
f"`layers_to_freeze_regex` ({self.layers_to_freeze_regex}) must be a valid regular expression."
)
learning_rate: float | str = schema_utils.OneOfOptionsField(
default=0.001,
allow_none=False,
description=(
"Controls how much to change the model in response to the estimated error each time the model weights are "
"updated. If 'auto', the optimal learning rate is estimated by choosing the learning rate that produces "
"the smallest non-diverging gradient update."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate"],
field_options=[
schema_utils.FloatRange(default=0.001, allow_none=False, min=0, max=1),
schema_utils.StringOptions(options=["auto"], default="auto", allow_none=False),
],
)
learning_rate_scheduler: LRSchedulerConfig = LRSchedulerDataclassField(
description="Parameter values for learning rate scheduler.",
default=None,
)
epochs: int = schema_utils.PositiveInteger(
default=100,
description="Number of epochs the algorithm is intended to be run over. Overridden if `train_steps` is set",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["epochs"],
)
checkpoints_per_epoch: int = schema_utils.NonNegativeInteger(
default=0,
description=(
"Number of checkpoints per epoch. For example, 2 -> checkpoints are written every half of an epoch. Note "
"that it is invalid to specify both non-zero `steps_per_checkpoint` and non-zero `checkpoints_per_epoch`."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["checkpoints_per_epoch"],
)
train_steps: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description=(
"Maximum number of training steps the algorithm is intended to be run over. Unset by default. "
"If set, will override `epochs` and if left unset then `epochs` is used to determine training length."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["train_steps"],
)
eval_steps: float = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="The number of steps to use for evaluation. If None, the entire evaluation set will be used.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["eval_steps"],
)
steps_per_checkpoint: int = schema_utils.NonNegativeInteger(
default=0,
description=(
"How often the model is checkpointed. Also dictates maximum evaluation frequency. If 0 the model is "
"checkpointed after every epoch."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["steps_per_checkpoint"],
)
effective_batch_size: int | str = schema_utils.OneOfOptionsField(
default=AUTO,
allow_none=False,
description=(
"The effective batch size is the total number of samples used to compute a single gradient update "
"to the model weights. This differs from `batch_size` by taking `gradient_accumulation_steps` and number "
"of training worker processes into account. In practice, "
"`effective_batch_size = batch_size * gradient_accumulation_steps * num_workers`. "
"If 'auto', the effective batch size is derivied implicitly from `batch_size`, but if set explicitly, then "
"one of `batch_size` or `gradient_accumulation_steps` must be set to something other than 'auto', and "
"consequently will be set following the formula given above."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD][EFFECTIVE_BATCH_SIZE],
field_options=[
schema_utils.PositiveInteger(default=128, description="", allow_none=False),
schema_utils.StringOptions(options=["auto"], default="auto", allow_none=False),
],
)
batch_size: int | str = schema_utils.OneOfOptionsField(
default=AUTO,
allow_none=False,
description=(
"The number of training examples utilized in one training step of the model. If ’auto’, the "
"batch size that maximized training throughput (samples / sec) will be used. For CPU training, the "
"tuned batch size is capped at 128 as throughput benefits of large batch sizes are less noticeable without "
"a GPU."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["batch_size"],
field_options=[
schema_utils.PositiveInteger(default=128, description="", allow_none=False),
schema_utils.StringOptions(options=["auto"], default="auto", allow_none=False),
],
)
max_batch_size: int = schema_utils.PositiveInteger(
default=MAX_POSSIBLE_BATCH_SIZE,
allow_none=True,
description=(
"Auto batch size tuning and increasing batch size on plateau will be capped at this value. The default "
"value is 2^40."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD][MAX_BATCH_SIZE],
)
gradient_accumulation_steps: int | str = schema_utils.OneOfOptionsField(
default=AUTO,
allow_none=False,
description="Number of steps to accumulate gradients over before performing a weight update.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["gradient_accumulation_steps"],
field_options=[
schema_utils.PositiveInteger(default=1, description="", allow_none=False),
schema_utils.StringOptions(options=["auto"], default="auto", allow_none=False),
],
)
early_stop: int = schema_utils.IntegerRange(
default=5,
min=-1,
description=(
"Number of consecutive rounds of evaluation without any improvement on the `validation_metric` that "
"triggers training to stop. Can be set to -1, which disables early stopping entirely."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["early_stop"],
)
eval_batch_size: None | int | str = schema_utils.OneOfOptionsField(
default=None,
allow_none=True,
description=(
"Size of batch to pass to the model for evaluation. If it is `0` or `None`, the same value of `batch_size` "
"is used. This is useful to speedup evaluation with a much bigger batch size than training, if enough "
"memory is available. If ’auto’, the biggest batch size (power of 2) that can fit in memory will be used."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["eval_batch_size"],
field_options=[
schema_utils.PositiveInteger(default=128, description="", allow_none=False),
schema_utils.StringOptions(options=["auto"], default="auto", allow_none=False),
],
)
evaluate_training_set: bool = schema_utils.Boolean(
default=False,
description=(
"Whether to evaluate on the entire training set during evaluation. By default, training metrics will be "
"computed at the end of each training step, and accumulated up to the evaluation phase. In practice, "
"computing training set metrics during training is up to 30% faster than running a separate evaluation "
"pass over the training set, but results in more noisy training metrics, particularly during the earlier "
"epochs. It's recommended to only set this to True if you need very exact training set metrics, and are "
"willing to pay a significant performance penalty for them."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["evaluate_training_set"],
)
validation_field: str = schema_utils.String(
default=None,
allow_none=True,
description="The field for which the `validation_metric` is used for validation-related mechanics like early "
"stopping, parameter change plateaus, as well as what hyperparameter optimization uses to determine the best "
"trial. If unset (default), the first output feature is used. If explicitly specified, neither "
"`validation_field` nor `validation_metric` are overwritten.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["validation_field"],
)
validation_metric: str = schema_utils.String(
default=None,
allow_none=True,
description=(
"Metric from `validation_field` that is used. If validation_field is not explicitly specified, this is "
"overwritten to be the first output feature type's `default_validation_metric`, consistent with "
"validation_field. If the validation_metric is specified, then we will use the first output feature that "
"produces this metric as the `validation_field`."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["validation_metric"],
)
optimizer: BaseOptimizerConfig = OptimizerDataclassField(
default="adam",
description=(
"Optimizer type and its parameters. The optimizer is responsble for applying the gradients computed "
"from the loss during backpropagation as updates to the model weights."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["optimizer"],
)
regularization_type: str | None = schema_utils.RegularizerOptions(
default="l2",
allow_none=True,
description="Type of regularization.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["regularization_type"],
)
regularization_lambda: float = schema_utils.FloatRange(
default=0.0,
min=0,
max=1,
description="Strength of the regularization.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["regularization_lambda"],
)
should_shuffle: bool = schema_utils.Boolean(
default=True,
description="Whether to shuffle batches during training when true.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["should_shuffle"],
)
increase_batch_size_on_plateau: int = schema_utils.NonNegativeInteger(
default=0,
description="The number of times to increase the batch size on a plateau.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["increase_batch_size_on_plateau"],
)
increase_batch_size_on_plateau_patience: int = schema_utils.NonNegativeInteger(
default=5,
description="How many epochs to wait for before increasing the batch size.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["increase_batch_size_on_plateau_patience"],
)
increase_batch_size_on_plateau_rate: float = schema_utils.NonNegativeFloat(
default=2.0,
description="Rate at which the batch size increases.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["increase_batch_size_on_plateau_rate"],
)
increase_batch_size_eval_metric: str = schema_utils.String(
default=LOSS,
description="Which metric to listen on for increasing the batch size.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["increase_batch_size_eval_metric"],
)
increase_batch_size_eval_split: str = schema_utils.String(
default=TRAINING,
description="Which dataset split to listen on for increasing the batch size.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["increase_batch_size_eval_split"],
)
gradient_clipping: GradientClippingConfig | None = GradientClippingDataclassField(
description="Parameter values for gradient clipping.",
default={},
)
learning_rate_scaling: str = schema_utils.StringOptions(
["constant", "sqrt", "linear"],
default="linear",
description="Scale by which to increase the learning rate as the number of distributed workers increases. "
"Traditionally the learning rate is scaled linearly with the number of workers to reflect the "
"proportion by"
" which the effective batch size is increased. For very large batch sizes, a softer square-root "
"scale can "
"sometimes lead to better model performance. If the learning rate is hand-tuned for a given "
"number of "
"workers, setting this value to constant can be used to disable scale-up.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate_scaling"],
)
bucketing_field: str = schema_utils.String(
default=None,
allow_none=True,
description="Feature to use for bucketing datapoints",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["bucketing_field"],
)
use_mixed_precision: bool = schema_utils.Boolean(
default=False,
description="Enable automatic mixed-precision (AMP) during training.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["use_mixed_precision"],
)
compile: bool = schema_utils.Boolean(
default=False,
description="Whether to compile the model before training.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["compile"],
)
enable_gradient_checkpointing: bool = schema_utils.Boolean(
default=False,
description="Whether to enable gradient checkpointing, which trades compute for memory."
"This is useful for training very deep models with limited memory.",
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["enable_gradient_checkpointing"],
)
layers_to_freeze_regex: str = schema_utils.String(
default=None,
allow_none=True,
description=(
"Freeze specific layers based on provided regex. Freezing specific layers can improve a "
"pretrained model's performance in a number of ways. At a basic level, freezing early layers can "
"prevent overfitting by retaining more general features (beneficial for small datasets). Also can "
"reduce computational resource use and lower overall training time due to less gradient calculations. "
),
)
def update_batch_size_grad_accum(self, num_workers: int):
from ludwig.utils.trainer_utils import get_rendered_batch_size_grad_accum
self.batch_size, self.gradient_accumulation_steps = get_rendered_batch_size_grad_accum(self, num_workers)
@DeveloperAPI
@ludwig_dataclass
class LLMTrainerConfig(BaseTrainerConfig):
"""Base class for all LLM trainer configs."""
learning_rate: float | str = schema_utils.OneOfOptionsField(
default=0.0002,
allow_none=False,
description=(
"Controls how much to change the model in response to the estimated error each time the model weights are "
"updated. If 'auto', the optimal learning rate is estimated by choosing the learning rate that produces "
"the smallest non-diverging gradient update."
),
parameter_metadata=TRAINER_METADATA[MODEL_ECD]["learning_rate"],
field_options=[
schema_utils.FloatRange(default=0.001, allow_none=False, min=0, max=1),
schema_utils.StringOptions(options=["auto"], default="auto", allow_none=False),
],
)
batch_size: int = schema_utils.PositiveInteger(
default=1,
description="Batch size used for training in the LLM trainer.",
)
base_learning_rate: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Base learning rate used for training in the LLM trainer.",
)
should_shuffle: bool = schema_utils.Boolean(
default=True,
description="Whether to shuffle the training data in the LLM trainer.",
)
epochs: int = schema_utils.PositiveInteger(
default=3,
description="Number of epochs to train in the LLM trainer.",
)
train_steps: int = schema_utils.PositiveInteger(
default=None,
allow_none=True,
description="Number of training steps to train in the LLM trainer.",
)
eval_steps: float = schema_utils.NonNegativeInteger(
default=None,
allow_none=True,
description="The number of steps to evaluate in the LLM trainer.",
)
steps_per_checkpoint: int = schema_utils.NonNegativeInteger(
default=0,
description="Number of steps per checkpoint in the LLM trainer.",
)
checkpoints_per_epoch: int = schema_utils.NonNegativeInteger(
default=0,
description="Number of checkpoints per epoch in the LLM trainer.",
)
early_stop: int = schema_utils.IntegerRange(
default=-1,
min=-1,
description=(
"Number of consecutive rounds of evaluation without any improvement on the `validation_metric` that "
"triggers training to stop. Can be set to -1, which disables early stopping entirely."
),
)
eval_batch_size: int = schema_utils.PositiveInteger(
default=2,
description="Batch size used for evaluation in the LLM trainer.",
)
evaluate_training_set: bool = schema_utils.Boolean(
default=False,
description="Whether to evaluate the training set in the LLM trainer. Note: this operation may be slow.",
)
@DeveloperAPI
@register_llm_trainer_schema("none")
@ludwig_dataclass
class NoneTrainerConfig(LLMTrainerConfig):
"""Dataclass that configures most of the hyperparameters used for zero-shot / few-shot LLM model training."""
# Required for lookup during trainer initialization
type: str = schema_utils.ProtectedString(
"none",
description="The type of trainer used to train the model. ",
parameter_metadata=TRAINER_METADATA[MODEL_LLM]["type"],
)
def can_tune_batch_size(self) -> bool:
return False
@DeveloperAPI
@register_llm_trainer_schema("finetune")
@ludwig_dataclass
class FineTuneTrainerConfig(ECDTrainerConfig):
"""Dataclass that configures most of the hyperparameters used for fine-tuning LLM model training."""
# Required for lookup during trainer initialization
type: str = schema_utils.ProtectedString("finetune")
base_learning_rate: float = schema_utils.NonNegativeFloat(
default=0.0,
description="Base learning rate used for training in the LLM trainer.",
)
batch_size: int | str | None = schema_utils.OneOfOptionsField(
default=1,
allow_none=False,
description=(
"The number of training examples utilized in one training step of the model. If `auto`, the "
"batch size that maximized training throughput (samples / sec) will be used."
),
field_options=[
schema_utils.PositiveInteger(default=1, description="", allow_none=False),
schema_utils.StringOptions(options=["auto"], default="auto", allow_none=False),
],
)
eval_batch_size: int | str | None = schema_utils.OneOfOptionsField(
default=2,
allow_none=True,
description=(
"Size of batch to pass to the model for evaluation. If it is `0` or `None`, the same value of `batch_size` "
"is used. This is useful to speedup evaluation with a much bigger batch size than training, if enough "
"memory is available. If `auto`, the biggest batch size (power of 2) that can fit in memory will be used."
),
field_options=[
schema_utils.PositiveInteger(default=2, description="", allow_none=False),
schema_utils.StringOptions(options=["auto"], default="auto", allow_none=False),
],
)
@DeveloperAPI
def get_model_type_jsonschema(model_type: str = MODEL_ECD):
if model_type == MODEL_LLM:
enum = [MODEL_LLM]
else:
enum = [MODEL_ECD]
return {
"type": "string",
"enum": enum,
"default": MODEL_ECD,
"title": "model_type",
"description": "Select the model type.",
}
@DeveloperAPI
def get_trainer_jsonschema(model_type: str):
trainer_cls = trainer_schema_registry[model_type]
props = schema_utils.unload_jsonschema_from_marshmallow_class(trainer_cls)["properties"]
return {
"type": "object",
"properties": props,
"title": "trainer_options",
"additionalProperties": False,
"description": "Schema for trainer determined by Model Type",
}
@DeveloperAPI
class ECDTrainerField(schema_utils.DictMarshmallowField):
def __init__(self):
super().__init__(ECDTrainerConfig)
def _jsonschema_type_mapping(self):
return get_trainer_jsonschema(MODEL_ECD)
@DeveloperAPI
def get_llm_trainer_conds():
"""Returns a JSON schema of conditionals to validate against adapter types."""
conds = []
for trainer in _llm_trainer_schema_registry:
trainer_cls = _llm_trainer_schema_registry[trainer]
other_props = schema_utils.unload_jsonschema_from_marshmallow_class(trainer_cls)["properties"]
schema_utils.remove_duplicate_fields(other_props)
preproc_cond = schema_utils.create_cond(
{"type": trainer},
other_props,
)
conds.append(preproc_cond)
return conds
@DeveloperAPI
def LLMTrainerDataclassField(default="none", description=""):
class LLMTrainerSelection(schema_utils.TypeSelection):
def __init__(self):
super().__init__(
registry=_llm_trainer_schema_registry,
default_value=default,
description=description,
)
def get_schema_from_registry(self, key: str) -> type[schema_utils.BaseMarshmallowConfig]:
return get_llm_trainer_cls(key)
def _jsonschema_type_mapping(self):
return {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": list(_llm_trainer_schema_registry.keys()),
"default": default,
"description": "The type of LLM trainer to use",
},
},
"title": "llm_trainer_options",
"allOf": get_llm_trainer_conds(),
"required": ["type"],
"description": description,
}
return LLMTrainerSelection().get_default_field()
================================================
FILE: ludwig/schema/utils.py
================================================
"""Ludwig schema utilities - pydantic 2 based.
This module provides the foundation for Ludwig's declarative config system.
All config classes inherit from BaseMarshmallowConfig (a pydantic BaseModel)
and use field factory functions (String, Integer, Float, etc.) that return
pydantic Field() objects.
"""
import copy
import logging
import os
import warnings
from abc import ABC, abstractmethod
from functools import lru_cache
from typing import Any
import yaml
from pydantic import BaseModel, ConfigDict, Field, model_validator
from pydantic import ValidationError as PydanticValidationError
from pydantic.fields import FieldInfo
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ACTIVE, COLUMN, LUDWIG_SCHEMA_VALIDATION_POLICY, NAME, PROC_COLUMN, TYPE
from ludwig.error import ConfigValidationError
from ludwig.modules.reduction_modules import reduce_mode_registry
from ludwig.schema.metadata import COMMON_METADATA
from ludwig.schema.metadata.parameter_metadata import convert_metadata_to_json, ParameterMetadata
from ludwig.utils.misc_utils import scrub_creds
from ludwig.utils.registry import Registry
from ludwig.utils.torch_utils import activations, initializer_registry
# ============================================================================
# LudwigSchemaField - base class replacing marshmallow fields.Field
# ============================================================================
class LudwigSchemaField:
"""Plain Python base class for Ludwig schema fields.
Replaces marshmallow fields.Field as the base for TypeSelection, DictMarshmallowField (NestedConfigField), and all
custom field classes. The contract (get_default_field, _jsonschema_type_mapping, _deserialize) stays identical.
"""
def __init__(self, **kwargs):
# Store all keyword arguments as attributes for backward compat
for k, v in kwargs.items():
setattr(self, k, v)
def get_default_field(self) -> FieldInfo:
"""Create a pydantic FieldInfo for this field.
Override in subclasses.
"""
return Field(default=None)
def _jsonschema_type_mapping(self):
"""Return a JSON schema dict for this field.
Override in subclasses.
"""
return None
def _deserialize(self, value, attr, data, **kwargs):
"""Deserialize a raw value.
Override in subclasses.
"""
return value
logger = logging.getLogger(__name__)
RECURSION_STOP_ENUM = {"weights_initializer", "bias_initializer", "norm_params"}
def ludwig_dataclass(cls):
"""No-op decorator.
Config classes now inherit directly from BaseMarshmallowConfig (pydantic BaseModel).
"""
return cls
# TODO: Change to RAISE and update descriptions once we want to enforce strict schemas.
LUDWIG_SCHEMA_VALIDATION_POLICY_VAR = os.environ.get(LUDWIG_SCHEMA_VALIDATION_POLICY, "exclude").lower()
class _SchemaAdapter:
"""Adapts pydantic model to marshmallow-like Schema interface for backward compatibility.
This allows existing code that calls cls.Schema().load(data), cls.Schema().dump(data), and cls.Schema().fields to
continue working.
"""
def __init__(self, cls):
self._cls = cls
def __call__(self):
"""Allow Schema()() pattern (double-call)."""
return self
def load(self, data):
"""Validate and create a config instance from a dict."""
return self._cls.model_validate(data)
def dump(self, data):
"""Serialize a config instance or dict to a plain dict."""
if isinstance(data, BaseMarshmallowConfig):
return data.to_dict()
if isinstance(data, dict):
try:
instance = self._cls.model_validate(data)
return instance.to_dict()
except Exception:
return data
return data
@property
def fields(self):
"""Return field info dict (pydantic model_fields)."""
return self._cls.model_fields
# Sentinel for TypeSelection and DictMarshmallowField metadata markers
class _TypeSelectionMarker:
"""Marker stored in Field.metadata to indicate this field uses TypeSelection dispatch."""
def __init__(self, type_selection):
self.type_selection = type_selection
class _NestedConfigMarker:
"""Marker stored in Field.metadata to indicate this field uses DictMarshmallowField dispatch."""
def __init__(self, cls, allow_none=True):
self.cls = cls
self.allow_none = allow_none
ConfigT = Any # TypeVar("ConfigT", bound="BaseMarshmallowConfig")
def _convert_dataclass_field_to_pydantic(dc_field) -> FieldInfo:
"""Convert a dataclasses.Field to a pydantic FieldInfo.
This is the bridge that allows old marshmallow-style field definitions
(using dataclasses.field(metadata={"marshmallow_field": ...})) to work
with pydantic BaseModel classes during the migration period.
"""
import dataclasses as _dc
metadata_list = []
marshmallow_field = None
# Extract marshmallow_field from metadata
if dc_field.metadata:
marshmallow_field = dc_field.metadata.get("marshmallow_field")
if marshmallow_field is not None:
# Store as a marker so model_validator can use it for dispatch
if isinstance(marshmallow_field, TypeSelection):
metadata_list.append(_TypeSelectionMarker(marshmallow_field))
elif isinstance(marshmallow_field, DictMarshmallowField):
# Check if the subclass overrides _jsonschema_type_mapping
has_custom_schema = (
type(marshmallow_field)._jsonschema_type_mapping
is not DictMarshmallowField._jsonschema_type_mapping
)
if has_custom_schema:
# Store as MarshmallowFieldMarker to preserve custom JSON schema generation
metadata_list.append(_MarshmallowFieldMarker(marshmallow_field))
else:
metadata_list.append(_NestedConfigMarker(marshmallow_field.cls, marshmallow_field.allow_none))
else:
# Generic marshmallow field - store for reference
metadata_list.append(_MarshmallowFieldMarker(marshmallow_field))
# Extract default and create FieldInfo.
# Note: pydantic 2's Field() does not accept a `metadata` kwarg — set it on the FieldInfo after creation.
if dc_field.default is not _dc.MISSING:
fi = Field(default=dc_field.default)
elif dc_field.default_factory is not _dc.MISSING:
fi = Field(default_factory=dc_field.default_factory)
else:
# No default - this is a required field
fi = Field()
if metadata_list:
fi.metadata = metadata_list
return fi
class _MarshmallowFieldMarker:
"""Stores a marshmallow field for backward compat during migration."""
def __init__(self, marshmallow_field):
self.marshmallow_field = marshmallow_field
class _LudwigModelMeta(type(BaseModel)):
"""Metaclass that bridges marshmallow-dataclass patterns to pydantic 2.
Handles two key behaviors:
1. Converts dataclasses.Field objects to pydantic FieldInfo in __new__
2. Allows class-level access to field defaults via __getattr__
"""
def __new__(mcs, name, bases, namespace, **kwargs):
import dataclasses as _dc
annotations = namespace.get("__annotations__", {})
# Detect @property definitions and prevent pydantic from treating them as field defaults.
# Properties that don't shadow inherited fields work fine as-is because pydantic
# only processes annotated attributes. Properties that DO shadow inherited fields
# should be converted to fields with constant defaults instead (done at the schema
# class level, not here).
_saved_properties: dict[str, property] = {}
for attr_name, value in list(namespace.items()):
if isinstance(value, property) and attr_name in annotations:
# A property in this class's own annotations would confuse pydantic.
# Remove it from annotations (it won't become a field).
_saved_properties[attr_name] = value
del namespace[attr_name]
annotations.pop(attr_name, None)
# Convert dataclass field() objects and marshmallow field descriptors to pydantic Field()
for attr_name in list(annotations.keys()):
if attr_name in namespace:
value = namespace[attr_name]
if isinstance(value, _dc.Field):
namespace[attr_name] = _convert_dataclass_field_to_pydantic(value)
elif isinstance(value, LudwigSchemaField) and hasattr(value, "get_default_field"):
# TypeSelection and DictMarshmallowField instances need conversion
namespace[attr_name] = value.get_default_field()
# Auto-widen annotations to bridge marshmallow→pydantic gap.
# In marshmallow, annotations were decorative. In pydantic, they're enforced.
import types
import typing
for attr_name, ann in list(annotations.items()):
# Skip ClassVar annotations
origin = getattr(ann, "__origin__", None)
if origin is typing.ClassVar:
continue
if attr_name not in namespace:
continue
value = namespace[attr_name]
# For fields with markers (TypeSelection/DictMarshmallowField/MarshmallowField),
# set annotation to Any since the actual validation happens in the marker
if isinstance(value, FieldInfo):
jse = getattr(value, "json_schema_extra", None)
has_marker = False
if isinstance(jse, dict) and "metadata" in jse:
has_marker = any(
isinstance(m, (_TypeSelectionMarker, _NestedConfigMarker, _MarshmallowFieldMarker))
for m in jse["metadata"]
)
for meta in getattr(value, "metadata", None) or []:
if isinstance(meta, (_TypeSelectionMarker, _NestedConfigMarker, _MarshmallowFieldMarker)):
has_marker = True
break
if has_marker:
annotations[attr_name] = Any
continue
# Widen to include None if default is None or enum contains None
from pydantic_core import PydanticUndefined
should_widen = value.default is None and value.default is not PydanticUndefined
if not should_widen:
# Also widen if the enum (from allow_none=True in StringOptions etc.) contains None
jse_enum = (jse or {}).get("enum") if isinstance(jse, dict) else None
if isinstance(jse_enum, list) and None in jse_enum:
should_widen = True
if not should_widen:
# Also widen if allow_none=True was explicitly set in the field factory
if isinstance(jse, dict) and jse.get("allow_none"):
should_widen = True
if should_widen:
is_union = origin in (types.UnionType,)
try:
is_union = is_union or origin is typing.Union
except (AttributeError, TypeError):
pass
has_none = False
if is_union:
has_none = type(None) in getattr(ann, "__args__", ())
if not has_none:
try:
annotations[attr_name] = ann | None
except TypeError:
pass
elif value is None:
# Plain None default
try:
annotations[attr_name] = ann | None
except TypeError:
pass
namespace["__annotations__"] = annotations
import warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="Field name .* shadows an attribute in parent")
cls = super().__new__(mcs, name, bases, namespace, **kwargs)
# Restore @property descriptors that we removed from namespace.
if _saved_properties:
for pname, prop in _saved_properties.items():
setattr(cls, pname, prop)
return cls
def __getattr__(cls, name: str) -> Any:
"""Allow accessing field defaults as class attributes (e.g., cls.type)."""
for klass in cls.__mro__:
pf = vars(klass).get("__pydantic_fields__")
if pf is not None and isinstance(pf, dict) and name in pf:
field_info = pf[name]
from pydantic_core import PydanticUndefined
if field_info.default is not PydanticUndefined:
return field_info.default
break
raise AttributeError(name)
@DeveloperAPI
class BaseMarshmallowConfig(BaseModel, metaclass=_LudwigModelMeta):
"""Base pydantic model for all Ludwig config classes.
Maintains backward-compatible API (from_dict, to_dict, Schema, etc.) while using pydantic 2 internally for
validation and serialization.
"""
model_config = ConfigDict(
extra="ignore" if LUDWIG_SCHEMA_VALIDATION_POLICY_VAR == "exclude" else "forbid",
arbitrary_types_allowed=True,
validate_default=False,
revalidate_instances="never",
populate_by_name=True,
strict=False,
)
@model_validator(mode="before")
@classmethod
def _pre_validate(cls, data: Any) -> Any:
"""Pre-validation: log deprecation warnings, resolve TypeSelection/nested fields."""
if not isinstance(data, dict):
return data
# Log deprecation warnings for unknown fields
valid_fields = set(cls.model_fields.keys())
for key in list(data.keys()):
if key not in valid_fields and key != "type":
warnings.warn(
f'"{key}" is not a valid parameter for the "{cls.__name__}" schema, will be flagged '
"as an error in a future version",
DeprecationWarning,
)
# Resolve TypeSelection, DictMarshmallowField, and legacy marshmallow fields
for fname, finfo in cls.model_fields.items():
if fname not in data:
continue
value = data[fname]
# Get markers from both metadata and json_schema_extra
markers = list(finfo.metadata or [])
jse = finfo.json_schema_extra
if isinstance(jse, dict) and "metadata" in jse:
markers.extend(jse["metadata"])
for meta in markers:
if isinstance(meta, _TypeSelectionMarker):
data[fname] = meta.type_selection.resolve(value)
break
elif isinstance(meta, _NestedConfigMarker):
if isinstance(value, BaseMarshmallowConfig):
break # Already a config instance, skip re-validation
if isinstance(value, dict):
try:
data[fname] = meta.cls.model_validate(value)
except Exception as e:
raise ConfigValidationError(
f"Invalid params: {value}, see `{meta.cls}` definition. Error: {e}"
)
break
elif isinstance(meta, _MarshmallowFieldMarker):
# Legacy marshmallow field - use its _deserialize for validation
# Skip if value is already a config instance (avoid double-validation)
if isinstance(value, BaseMarshmallowConfig):
break
mfield = meta.marshmallow_field
if hasattr(mfield, "_deserialize") and value is not None:
try:
data[fname] = mfield._deserialize(value, fname, data)
except Exception as e:
# Re-raise ConfigValidationError (from __post_init__) and
# from _deserialize rather than swallowing them
if isinstance(e, ConfigValidationError):
raise
pass # Let pydantic handle other validation errors
break
return data
@model_validator(mode="after")
def _validate_field_constraints(self):
"""Post-validation: enforce enum constraints stored in json_schema_extra."""
for fname, finfo in type(self).model_fields.items():
value = getattr(self, fname, None)
extra = finfo.json_schema_extra
if not isinstance(extra, dict):
continue
# Validate enum constraints (from StringOptions, IntegerOptions)
if "enum" in extra and value is not None:
allowed = extra["enum"]
if value not in allowed:
raise ValueError(f"Field '{fname}': value {value!r} not in allowed options {allowed}")
# Validate float tuple range constraints
if "_float_tuple_range" in extra and value is not None:
spec = extra["_float_tuple_range"]
if not isinstance(value, (tuple, list)) or len(value) != spec["n"]:
raise ValueError(f"Field '{fname}': expected {spec['n']}-tuple, got {value!r}")
for v in value:
if spec.get("min") is not None and v < spec["min"]:
raise ValueError(f"Field '{fname}': value {v} below minimum {spec['min']}")
if spec.get("max") is not None and v > spec["max"]:
raise ValueError(f"Field '{fname}': value {v} above maximum {spec['max']}")
# Validate embed field (int or str from options)
if "_embed_options" in extra and value is not None:
embed_options = extra["_embed_options"]
if isinstance(value, str) and value not in embed_options:
raise ValueError(f"Field '{fname}': string value {value!r} not in {embed_options}")
if not isinstance(value, (str, int)):
raise ValueError(f"Field '{fname}': expected str, int, or None, got {type(value).__name__}")
# Validate initializer_or_dict field
if "_initializer_options" in extra and value is not None:
init_options = extra["_initializer_options"]
if isinstance(value, str) and value not in init_options:
raise ValueError(f"Field '{fname}': initializer {value!r} not in {init_options}")
if isinstance(value, dict):
if "type" not in value:
raise ValueError(f"Field '{fname}': dict must contain 'type' key")
if value["type"] not in init_options:
raise ValueError(f"Field '{fname}': initializer type {value['type']!r} not in {init_options}")
if not isinstance(value, (str, dict)):
raise ValueError(f"Field '{fname}': expected str or dict, got {type(value).__name__}")
return self
def __setattr__(self, name: str, value: Any) -> None:
"""Allow setting arbitrary attributes on config instances.
Ludwig code dynamically sets attributes like saved_weights_in_checkpoint, proc_column, etc. on config objects.
Pydantic 2 normally rejects setting attributes not defined as fields, so we override to allow it.
"""
try:
super().__setattr__(name, value)
except ValueError:
# Attribute not in model fields - allow it anyway (dataclass behavior)
object.__setattr__(self, name, value)
def model_post_init(self, __context: Any) -> None:
"""Bridge: call __post_init__ if defined by subclass (dataclass convention)."""
super().model_post_init(__context)
# Check if THIS class (or a parent) defines __post_init__
post_init = getattr(type(self), "__post_init__", None)
if post_init is not None:
post_init(self)
def to_dict(self) -> dict[str, Any]:
"""Get a dictionary representation of this config.
Recursively converts nested config objects and scrubs credentials.
"""
return scrub_creds(convert_submodules(vars(self)))
@classmethod
def from_dict(cls, d: dict[str, Any]) -> "BaseMarshmallowConfig":
"""Create a config instance from a dictionary."""
return cls.model_validate(d)
@classmethod
@lru_cache(maxsize=None)
def get_valid_field_names(cls) -> set[str]:
"""Return the set of valid field names for this config class."""
return set(cls.model_fields.keys())
@classmethod
@lru_cache(maxsize=None)
def get_class_schema(cls):
"""Return a schema adapter for backward compatibility.
Returns an object with .load() and .fields methods.
"""
return _SchemaAdapter(cls)
@classmethod
def Schema(cls):
"""Backward compatibility: return a schema adapter with .load(), .dump(), .fields."""
return _SchemaAdapter(cls)
def __repr__(self):
return yaml.dump(self.to_dict(), sort_keys=False)
@DeveloperAPI
def get_marshmallow_field_class_name(field_info):
"""Returns a human-readable string of the field class name.
For backward compat, checks both pydantic metadata and marshmallow_field.
"""
# Check for marshmallow_field in metadata (legacy)
if hasattr(field_info, "metadata"):
for meta in field_info.metadata or []:
if hasattr(meta, "__class__"):
return meta.__class__.__name__
# For pydantic FieldInfo, return the annotation name
if hasattr(field_info, "annotation"):
return str(field_info.annotation)
return "Unknown"
@DeveloperAPI
def load_config(cls: type["BaseMarshmallowConfig"], **kwargs) -> "BaseMarshmallowConfig":
"""Takes a config class and instantiates it with the given keyword args as parameters."""
assert_is_a_marshmallow_class(cls)
return cls.model_validate(kwargs)
@DeveloperAPI
def load_trainer_with_kwargs(model_type: str, kwargs: dict) -> tuple["BaseMarshmallowConfig", dict[str, Any]]:
"""Special case of `load_config_with_kwargs` for the trainer schemas."""
from ludwig.constants import MODEL_LLM
from ludwig.schema.trainer import ECDTrainerConfig, LLMTrainerConfig
if model_type == MODEL_LLM:
trainer_schema = LLMTrainerConfig
else:
trainer_schema = ECDTrainerConfig
return load_config_with_kwargs(trainer_schema, kwargs)
@DeveloperAPI
def load_config_with_kwargs(
cls: type["BaseMarshmallowConfig"], kwargs_overrides
) -> tuple["BaseMarshmallowConfig", dict[str, Any]]:
"""Instantiates a config class filtering kwargs to only valid fields.
Returns a tuple of (config, remaining_kwargs).
"""
assert_is_a_marshmallow_class(cls)
fields = cls.model_fields.keys()
return load_config(cls, **{k: v for k, v in kwargs_overrides.items() if k in fields}), {
k: v for k, v in kwargs_overrides.items() if k not in fields
}
@DeveloperAPI
def convert_submodules(config_dict: dict) -> dict[str, Any]:
"""Helper for converting submodules to dictionaries during config serialization."""
output_dict = copy.deepcopy(config_dict)
for k, v in output_dict.items():
if isinstance(v, dict):
convert_submodules(v)
elif isinstance(v, BaseMarshmallowConfig):
output_dict[k] = v.to_dict()
convert_submodules(output_dict[k])
elif isinstance(v, list):
output_dict[k] = [x.to_dict() if isinstance(x, BaseMarshmallowConfig) else x for x in v]
elif isinstance(v, ListSerializable):
output_dict[k] = v.to_list()
return output_dict
@DeveloperAPI
def create_cond(if_pred: dict, then_pred: dict):
"""Returns a JSONSchema conditional for the given if-then predicates."""
return {
"if": {"properties": {k: {"const": v} for k, v in if_pred.items()}},
"then": {"properties": then_pred},
}
@DeveloperAPI
def remove_duplicate_fields(properties: dict, fields: list[str] | None = None) -> None:
"""Util function for removing duplicated schema elements."""
duplicate_fields = [NAME, TYPE, COLUMN, PROC_COLUMN, ACTIVE] if fields is None else fields
for key in duplicate_fields:
if key in properties:
del properties[key]
@DeveloperAPI
class ListSerializable(ABC):
@abstractmethod
def to_list(self) -> list:
pass
@DeveloperAPI
def assert_is_a_marshmallow_class(cls):
"""Assert that cls is a Ludwig config class (pydantic BaseModel)."""
assert issubclass(
cls, BaseMarshmallowConfig
), f"Expected a Ludwig config class (BaseMarshmallowConfig subclass), but `{cls}` is not."
def _default_matches_json_type(default_val, type_str) -> bool:
"""Check if a default value is consistent with a JSON schema type string.
Returns True if the default value matches the type string, False otherwise. This is used to avoid emitting 'type':
'integer' when the default is 7.5 (float), which was a common pattern in the marshmallow era where type enforcement
was looser.
"""
if isinstance(type_str, list):
# Union type like ["integer", "null"]
return any(_default_matches_json_type(default_val, t) for t in type_str)
_CHECKS = {
"string": lambda v: isinstance(v, str),
"integer": lambda v: isinstance(v, int) and not isinstance(v, bool),
"number": lambda v: isinstance(v, (int, float)) and not isinstance(v, bool),
"boolean": lambda v: isinstance(v, bool),
"object": lambda v: isinstance(v, dict),
"array": lambda v: isinstance(v, (list, tuple)),
"null": lambda v: v is None,
}
check = _CHECKS.get(type_str)
if check is None:
return True # Unknown type, don't block
return check(default_val)
def _field_info_to_jsonschema(fname: str, finfo: FieldInfo, annotation: type | None = None) -> dict:
"""Convert a pydantic FieldInfo to a JSON schema fragment.
Checks metadata markers for TypeSelection/DictMarshmallowField/legacy marshmallow fields, and falls back to type-
based mapping for plain fields.
"""
# Check for markers in both metadata and json_schema_extra
markers = list(finfo.metadata or [])
jse = finfo.json_schema_extra
if isinstance(jse, dict) and "metadata" in jse:
markers.extend(jse["metadata"])
for meta in markers:
if isinstance(meta, _TypeSelectionMarker):
ts = meta.type_selection
custom = ts._jsonschema_type_mapping()
if custom is not None:
return custom
return {"type": "object"}
if isinstance(meta, _NestedConfigMarker):
return unload_jsonschema_from_marshmallow_class(meta.cls)
if isinstance(meta, _MarshmallowFieldMarker):
mf = meta.marshmallow_field
if hasattr(mf, "_jsonschema_type_mapping"):
custom = mf._jsonschema_type_mapping()
if custom is not None:
return custom
# Handle FeatureList-style fields with inner and length constraints
if hasattr(mf, "inner") and mf.inner is not None:
inner_schema = {}
if hasattr(mf.inner, "_jsonschema_type_mapping"):
inner_schema = mf.inner._jsonschema_type_mapping() or {}
result = {"type": "array", "items": inner_schema}
if hasattr(mf, "min_length") and mf.min_length is not None:
result["minItems"] = mf.min_length
if hasattr(mf, "max_length") and mf.max_length is not None:
result["maxItems"] = mf.max_length
return result
return {"type": "object"}
# Handle InitializerOrDict fields
from pydantic_core import PydanticUndefined
extra = finfo.json_schema_extra
if isinstance(extra, dict) and "_initializer_options" in extra:
init_options = extra["_initializer_options"]
return {
"oneOf": [
{"type": "string", "enum": init_options},
{
"type": "object",
"properties": {"type": {"type": "string", "enum": init_options}},
"required": ["type"],
"additionalProperties": True,
},
{"type": "null"},
],
"default": finfo.default if finfo.default is not PydanticUndefined else "xavier_uniform",
"description": finfo.description or "",
}
# Build schema from field info
schema: dict[str, Any] = {}
# Description
desc = finfo.description or ""
if desc:
schema["description"] = desc
# Default value
from pydantic_core import PydanticUndefined
if finfo.default is not PydanticUndefined:
if not callable(finfo.default) and not isinstance(finfo.default, property):
schema["default"] = finfo.default
# Enum constraint from json_schema_extra
extra = finfo.json_schema_extra
if isinstance(extra, dict):
if "enum" in extra:
schema["enum"] = extra["enum"]
if "parameter_metadata" in extra:
schema["parameter_metadata"] = copy.deepcopy(extra["parameter_metadata"])
# Always include parameter_metadata (default if not explicitly provided)
if "parameter_metadata" not in schema:
schema["parameter_metadata"] = convert_metadata_to_json(None)
# Map type annotation to JSON schema type
# Only emit type if annotation and default are consistent (avoid mismatches
# like annotation=int but default=7.5 which was common in marshmallow era)
if annotation is not None:
type_str = _annotation_to_json_type(annotation)
if type_str:
# If the enum contains None, the JSON schema type must include "null"
enum_vals = schema.get("enum")
if enum_vals is not None and None in enum_vals:
if isinstance(type_str, list):
if "null" not in type_str:
type_str = type_str + ["null"]
elif type_str != "null":
type_str = [type_str, "null"]
# Check for mismatch between annotation type and default value
from pydantic_core import PydanticUndefined
default_val = finfo.default if finfo.default is not PydanticUndefined else None
if default_val is not None and not _default_matches_json_type(default_val, type_str):
pass # Skip emitting type to avoid JSON schema validation failures
else:
schema["type"] = type_str
# Range constraints and pattern from pydantic Field metadata
from annotated_types import Ge, Gt, Le, Lt
for meta in finfo.metadata or []:
if isinstance(meta, Ge):
schema["minimum"] = meta.ge
elif isinstance(meta, Gt):
schema["exclusiveMinimum"] = meta.gt
elif isinstance(meta, Le):
schema["maximum"] = meta.le
elif isinstance(meta, Lt):
schema["exclusiveMaximum"] = meta.lt
elif hasattr(meta, "pattern") and getattr(meta, "pattern", None) is not None:
schema["pattern"] = meta.pattern
return schema
def _annotation_to_json_type(annotation) -> str | list | None:
"""Map a Python type annotation to a JSON schema type string."""
import types
origin = getattr(annotation, "__origin__", None)
# Handle Python 3.10+ union types (e.g. float | None) which are instances of
# types.UnionType directly, without __origin__
if isinstance(annotation, types.UnionType):
args = annotation.__args__
has_none = type(None) in args
non_none = [a for a in args if a is not type(None)]
if len(non_none) == 1:
base = _annotation_to_json_type(non_none[0])
if has_none and base:
return [base, "null"]
return base
return None
# Also handle typing.Union
try:
import typing
if origin is typing.Union:
args = annotation.__args__
has_none = type(None) in args
non_none = [a for a in args if a is not type(None)]
if len(non_none) == 1:
base = _annotation_to_json_type(non_none[0])
if has_none and base:
return [base, "null"]
return base
return None
except (AttributeError, TypeError):
pass
_TYPE_MAP = {
str: "string",
int: "integer",
float: "number",
bool: "boolean",
dict: "object",
list: "array",
tuple: "array",
}
if annotation in _TYPE_MAP:
return _TYPE_MAP[annotation]
return None
@DeveloperAPI
def unload_jsonschema_from_marshmallow_class(mclass, additional_properties: bool = True, title: str = None) -> dict:
"""Get a JSON schema dict for a Ludwig config class.
Iterates over pydantic model_fields and checks metadata markers for TypeSelection, DictMarshmallowField, and legacy
marshmallow fields.
"""
assert_is_a_marshmallow_class(mclass)
properties = {}
annotations = {}
# Gather annotations from the class and its MRO
for klass in reversed(mclass.__mro__):
annotations.update(getattr(klass, "__annotations__", {}))
for fname, finfo in mclass.model_fields.items():
ann = annotations.get(fname)
properties[fname] = _field_info_to_jsonschema(fname, finfo, ann)
schema = {
"type": "object",
"properties": properties,
"additionalProperties": additional_properties,
}
if title is not None:
schema["title"] = title
return schema
# ============================================================================
# Field Factory Functions
# ============================================================================
# All return pydantic Field() objects (FieldInfo) that can be used as class
# variable defaults in BaseMarshmallowConfig subclasses.
# ============================================================================
def _make_json_schema_extra(
description: str = "",
parameter_metadata: ParameterMetadata = None,
**extra,
) -> dict | None:
"""Build json_schema_extra dict for Field(), returning None if empty."""
result = {}
if parameter_metadata:
result["parameter_metadata"] = convert_metadata_to_json(parameter_metadata)
result.update(extra)
return result or None
@DeveloperAPI
def InitializerOptions(default: str = "xavier_uniform", description="", parameter_metadata: ParameterMetadata = None):
"""Utility wrapper that returns a `StringOptions` field with keys from `initializer_registry`."""
return StringOptions(
list(initializer_registry.keys()),
default=default,
allow_none=False,
description=description,
parameter_metadata=parameter_metadata,
)
@DeveloperAPI
def ActivationOptions(default: str | None = "relu", description=None, parameter_metadata: ParameterMetadata = None):
"""Utility wrapper that returns a `StringOptions` field with keys from `activations` registry."""
description = description or "Default activation function applied to the output of the fully connected layers."
parameter_metadata = parameter_metadata or COMMON_METADATA["activation"]
return StringOptions(
list(activations.keys()),
default=default,
allow_none=True,
description=description,
parameter_metadata=parameter_metadata,
)
@DeveloperAPI
def ReductionOptions(default: None | str = None, description="", parameter_metadata: ParameterMetadata = None):
"""Utility wrapper that returns a `StringOptions` field with keys from `reduce_mode_registry`."""
return StringOptions(
list(reduce_mode_registry.keys()),
default=default,
allow_none=True,
description=description,
parameter_metadata=parameter_metadata,
)
@DeveloperAPI
def RegularizerOptions(
default: None | str,
allow_none: bool = False,
description="",
parameter_metadata: ParameterMetadata = None,
):
"""Utility wrapper that returns a `StringOptions` field with prefilled regularizer options."""
return StringOptions(
["l1", "l2", "l1_l2"],
default=default,
allow_none=allow_none,
description=description,
parameter_metadata=parameter_metadata,
)
@DeveloperAPI
def String(
description: str,
default: None | str,
allow_none: bool = False,
pattern: str = None,
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field for string values."""
if not allow_none and default is not None and not isinstance(default, str):
raise ValueError(f"Provided default `{default}` should be a string!")
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata, **extra_kwargs)
kwargs = {}
if pattern is not None:
kwargs["pattern"] = pattern
return Field(
default=default,
description=description,
json_schema_extra=json_extra,
**kwargs,
)
@DeveloperAPI
def StringOptions(
options: list[str],
default: None | str,
allow_none: bool = False,
description: str = "",
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field that enforces string inputs must be one of `options`."""
options = list(options) # ensure list, not dict_keys or other iterable
assert len(options) > 0, "Must provide non-empty list of options!"
if default is not None:
assert isinstance(default, str), f"Provided default `{default}` should be a string!"
if allow_none and None not in options:
options = options + [None]
if not allow_none and None in options:
options = [o for o in options if o is not None]
assert len(options) == len(
{o for o in options if o is not None} | ({None} if None in options else set())
), f"Provided options must be unique! See: {options}"
assert default in options, f"Provided default `{default}` is not one of allowed options: {options}"
json_extra = _make_json_schema_extra(
description=description,
parameter_metadata=parameter_metadata,
enum=options,
)
return Field(default=default, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def ProtectedString(
pstring: str,
description: str = "",
parameter_metadata: ParameterMetadata = None,
):
"""Alias for a `StringOptions` field with only one option."""
return StringOptions(
options=[pstring],
default=pstring,
allow_none=False,
description=description,
parameter_metadata=parameter_metadata,
)
@DeveloperAPI
def IntegerOptions(
options: list[int],
default: None | int,
allow_none: bool = False,
description: str = "",
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field that enforces integer inputs must be one of `options`."""
if len(options) <= 0:
raise ValueError("Must provide non-empty list of options!")
if default is not None and not isinstance(default, int):
raise ValueError(f"Provided default `{default}` should be an int!")
if allow_none and None not in options:
options = list(options) + [None]
if not allow_none and None in options:
options = [o for o in options if o is not None]
if default not in options:
raise ValueError(f"Provided default `{default}` is not one of allowed options: {options}")
json_extra = _make_json_schema_extra(
description=description,
parameter_metadata=parameter_metadata,
enum=options,
)
return Field(default=default, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def Boolean(default: bool, description: str = "", parameter_metadata: ParameterMetadata = None):
"""Returns a pydantic Field for boolean values."""
if default is not None and not isinstance(default, bool):
raise ValueError(f"Invalid default: `{default}`")
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata)
return Field(default=default, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def Integer(
default: None | int,
allow_none=False,
description="",
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field strictly enforcing integer inputs."""
if default is not None and not isinstance(default, int):
raise ValueError(f"Invalid default: `{default}`")
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata, **extra_kwargs)
return Field(default=default, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def PositiveInteger(
description: str,
default: None | int,
allow_none: bool = False,
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field enforcing positive integer inputs (>= 1)."""
if default is not None:
if not isinstance(default, int) or default < 1:
raise ValueError(f"Invalid default: `{default}`")
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata, **extra_kwargs)
return Field(default=default, ge=1, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def NonNegativeInteger(
description: str,
default: None | int,
allow_none: bool = False,
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field enforcing nonnegative integer inputs (>= 0)."""
if default is not None:
if not isinstance(default, int) or default < 0:
raise ValueError(f"Invalid default: `{default}`")
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata, **extra_kwargs)
return Field(default=default, ge=0, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def IntegerRange(
description: str,
default: None | int,
allow_none: bool = False,
parameter_metadata: ParameterMetadata = None,
min: int = None,
max: int = None,
min_inclusive: bool = True,
max_inclusive: bool = True,
):
"""Returns a pydantic Field enforcing integer inputs within a range."""
if default is not None:
if not isinstance(default, int):
raise ValueError(f"Invalid default: `{default}`")
if min is not None and ((min_inclusive and default < min) or (not min_inclusive and default <= min)):
raise ValueError(f"Invalid default: `{default}` (below min {min})")
if max is not None and ((max_inclusive and default > max) or (not max_inclusive and default >= max)):
raise ValueError(f"Invalid default: `{default}` (above max {max})")
kwargs = {}
if min is not None:
kwargs["ge" if min_inclusive else "gt"] = min
if max is not None:
kwargs["le" if max_inclusive else "lt"] = max
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata, **extra_kwargs)
return Field(default=default, description=description, json_schema_extra=json_extra, **kwargs)
@DeveloperAPI
def Float(
default: None | float | int,
allow_none=False,
description="",
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field for float inputs."""
if default is not None and not isinstance(default, (float, int)):
raise ValueError(f"Invalid default: `{default}`")
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata, **extra_kwargs)
return Field(default=default, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def NonNegativeFloat(
default: None | float,
allow_none: bool = False,
description: str = "",
max: float | None = None,
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field enforcing nonnegative float inputs."""
if default is not None:
if not isinstance(default, (float, int)) or default < 0:
raise ValueError(f"Invalid default: `{default}`")
if max is not None and default > max:
raise ValueError(f"Invalid default: `{default}` (above max {max})")
kwargs = {"ge": 0.0}
if max is not None:
kwargs["le"] = max
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata, **extra_kwargs)
return Field(default=default, description=description, json_schema_extra=json_extra, **kwargs)
@DeveloperAPI
def FloatRange(
default: None | float,
allow_none: bool = False,
description: str = "",
parameter_metadata: ParameterMetadata = None,
min: int = None,
max: int = None,
min_inclusive: bool = True,
max_inclusive: bool = True,
):
"""Returns a pydantic Field enforcing float inputs within a range."""
if default is not None:
if not isinstance(default, (float, int)):
raise ValueError(f"Invalid default: `{default}`")
kwargs = {}
if min is not None:
kwargs["ge" if min_inclusive else "gt"] = min
if max is not None:
kwargs["le" if max_inclusive else "lt"] = max
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata, **extra_kwargs)
return Field(default=default, description=description, json_schema_extra=json_extra, **kwargs)
@DeveloperAPI
def Dict(
default: None | dict = None,
allow_none: bool = True,
description: str = "",
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field for dict values."""
allow_none = allow_none or default is None
if default is not None:
if not isinstance(default, dict):
raise ValueError(f"Invalid default: `{default}`")
if not all(isinstance(k, str) for k in default.keys()):
raise ValueError(f"Invalid default: `{default}` (non-string keys)")
elif not allow_none:
default = {}
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata)
if default is None:
return Field(default=None, description=description, json_schema_extra=json_extra)
return Field(default_factory=lambda: copy.deepcopy(default), description=description, json_schema_extra=json_extra)
@DeveloperAPI
def List(
list_type: type[str] | type[int] | type[float] | type[list] = str,
inner_type: type[str] | type[int] | type[float] | type[dict] = float,
default: None | list[Any] = None,
allow_none: bool = True,
description: str = "",
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field for list values."""
if default is not None:
if not isinstance(default, list):
raise ValueError(f"Invalid default: `{default}`")
elif not allow_none:
default = []
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata)
if default is None:
return Field(default=None, description=description, json_schema_extra=json_extra)
return Field(default_factory=lambda: copy.deepcopy(default), description=description, json_schema_extra=json_extra)
@DeveloperAPI
def DictList(
default: None | list[dict] = None,
allow_none: bool = True,
description: str = "",
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field for list-of-dicts values."""
if default is not None:
if not isinstance(default, list) or not all(isinstance(d, dict) for d in default):
raise ValueError(f"Invalid default: `{default}`")
elif not allow_none:
default = []
json_extra = _make_json_schema_extra(description=description, parameter_metadata=parameter_metadata)
if default is None:
return Field(default=None, description=description, json_schema_extra=json_extra)
return Field(default_factory=lambda: copy.deepcopy(default), description=description, json_schema_extra=json_extra)
@DeveloperAPI
def Embed(description: str = "", parameter_metadata: ParameterMetadata = None):
"""Returns a pydantic Field for embedding input feature names (int, str, or None)."""
_embed_options = ["add"]
json_extra = _make_json_schema_extra(
description=description,
parameter_metadata=parameter_metadata,
_embed_options=_embed_options,
)
return Field(default=None, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def InitializerOrDict(
default: str = "xavier_uniform", description: str = "", parameter_metadata: ParameterMetadata = None
):
"""Returns a pydantic Field allowing str or dict initializer values."""
initializers = list(initializer_registry.keys())
if not isinstance(default, str) or default not in initializers:
raise ValueError(f"Invalid default: `{default}`")
json_extra = _make_json_schema_extra(
description=description,
parameter_metadata=parameter_metadata,
_initializer_options=initializers,
)
return Field(default=default, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def FloatRangeTupleDataclassField(
n: int = 2,
default: tuple | None = (0.9, 0.999),
allow_none: bool = False,
min: int | None = 0,
max: int | None = 1,
description: str = "",
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field for an N-dim tuple with values in a range."""
if default is not None:
if n != len(default):
raise ValueError(f"Dimension of tuple '{n}' must match dimension of default val. '{default}'")
for v in default:
if min is not None and v < min:
raise ValueError(f"Invalid default: value {v} below minimum {min}")
if max is not None and v > max:
raise ValueError(f"Invalid default: value {v} above maximum {max}")
if default is None and not allow_none:
raise ValueError("Default value must not be None if allow_none is False")
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(
description=description,
parameter_metadata=parameter_metadata,
_float_tuple_range={"n": n, "min": min, "max": max},
**extra_kwargs,
)
return Field(default=default, description=description, json_schema_extra=json_extra)
@DeveloperAPI
def OneOfOptionsField(
default: Any,
description: str,
field_options: list,
allow_none: bool = False,
parameter_metadata: ParameterMetadata = None,
):
"""Returns a pydantic Field that accepts values matching any of the field_options.
Pydantic union validation handles the multi-type dispatch. The field_options are stored in json_schema_extra for
JSON schema generation.
"""
extra_kwargs = {}
if allow_none:
extra_kwargs["allow_none"] = True
json_extra = _make_json_schema_extra(
description=description,
parameter_metadata=parameter_metadata,
_oneof_options=True,
**extra_kwargs,
)
if default is None or isinstance(default, (int, str, bool)):
return Field(default=default, description=description, json_schema_extra=json_extra)
return Field(default_factory=lambda: copy.deepcopy(default), description=description, json_schema_extra=json_extra)
# ============================================================================
# TypeSelection - Polymorphic config dispatch based on registry
# ============================================================================
class TypeSelection(LudwigSchemaField):
"""Resolves polymorphic config types from a registry based on a key field.
Used for fields like encoder, decoder, optimizer where the config class depends on a "type" key in the dict value.
"""
def __init__(
self,
registry: Registry,
default_value: str | None = None,
key: str = "type",
description: str = "",
parameter_metadata: ParameterMetadata = None,
allow_str_value: bool = False,
allow_none: bool = False,
**kwargs,
):
self.registry = registry
self.default_value = default_value
self.key = key
self.allow_str_value = allow_str_value
self.allow_none = allow_none
self.description = description
self.parameter_metadata = parameter_metadata
def _deserialize(self, value, attr, data, **kwargs):
"""Marshmallow deserialization - delegates to resolve()."""
return self.resolve(value)
def resolve(self, value):
"""Resolve a raw value (dict, str, None) to a config instance."""
if value is None:
if self.allow_none:
return None
return None
# Already a config instance
if isinstance(value, BaseMarshmallowConfig):
return value
if self.allow_str_value and isinstance(value, str):
value = self.str_value_to_object(value)
if isinstance(value, dict):
cls_type = value.get(self.key)
cls_type = cls_type.lower() if cls_type else self.default_value
if cls_type and cls_type in self.registry:
cls = self.get_schema_from_registry(cls_type)
try:
return cls.model_validate(value)
except (TypeError, PydanticValidationError) as e:
raise ConfigValidationError(f"Invalid params: {value}, see `{cls}` definition") from e
raise ConfigValidationError(f"Invalid type: '{cls_type}', expected one of: {list(self.registry.keys())}")
maybe_str = ", `str`," if self.allow_str_value else ""
raise ConfigValidationError(f"Invalid param {value}, expected `None`{maybe_str} or `dict`")
def str_value_to_object(self, value: str) -> dict:
"""Convert a string shorthand to a dict with the type key."""
return {self.key: value}
def get_schema_from_registry(self, key: str) -> type[BaseMarshmallowConfig]:
"""Look up a config class from the registry."""
return self.registry[key]
def get_default_field(self) -> FieldInfo:
"""Create a pydantic Field wrapping this TypeSelection.
The TypeSelection instance is stored in Field.metadata so the base class's model_validator can use it for
dispatch.
"""
if self.default_value is not None:
cls = self.get_schema_from_registry(self.default_value.lower())
key = self.key
dv = self.default_value
def default_factory(cls=cls, key=key, dv=dv):
return cls.model_validate({key: dv})
else:
def default_factory():
return None
fi = Field(default_factory=default_factory)
fi.metadata = [_TypeSelectionMarker(self)]
return fi
def _jsonschema_type_mapping(self):
"""Override in subclass for custom JSON schema."""
return None
@DeveloperAPI
class DictMarshmallowField(LudwigSchemaField):
"""Validates a dict as a specific config class (non-polymorphic).
Used for fields where a dict should be deserialized into a fixed config class.
"""
def __init__(
self,
cls: type[BaseMarshmallowConfig],
allow_none: bool = True,
default_missing: bool = False,
description: str = "",
**kwargs,
):
self.cls = cls
self.allow_none = allow_none
self.default_missing = default_missing
self.description = description
def _deserialize(self, value, attr, data, **kwargs):
"""Deserialize a dict to a config instance via pydantic model_validate."""
if value is None:
return value
if isinstance(value, dict):
try:
return self.cls.model_validate(value)
except (TypeError, PydanticValidationError) as e:
raise ConfigValidationError(f"Invalid params: {value}, see `{self.cls}` definition") from e
raise ConfigValidationError("Field should be None or dict")
def get_default_field(self) -> FieldInfo:
"""Create a pydantic Field wrapping this DictMarshmallowField."""
if not self.default_missing:
cls = self.cls
def default_factory(cls=cls):
return cls.model_validate({})
else:
def default_factory():
return None
# Check if subclass overrides _jsonschema_type_mapping - if so, use
# MarshmallowFieldMarker to preserve custom JSON schema generation
has_custom_schema = type(self)._jsonschema_type_mapping is not DictMarshmallowField._jsonschema_type_mapping
if has_custom_schema:
marker = _MarshmallowFieldMarker(self)
else:
marker = _NestedConfigMarker(self.cls, self.allow_none)
fi = Field(default_factory=default_factory)
fi.metadata = [marker]
return fi
def _jsonschema_type_mapping(self):
return unload_jsonschema_from_marshmallow_class(self.cls)
# Backward compatibility aliases
ValidationError = ConfigValidationError
NestedConfigField = DictMarshmallowField
LudwigConfig = BaseMarshmallowConfig
unload_jsonschema_from_config_class = unload_jsonschema_from_marshmallow_class
================================================
FILE: ludwig/serve.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import io
import json
import logging
import os
import sys
import tempfile
import pandas as pd
import torch
from torchvision.io import decode_image
from ludwig.api import LudwigModel
from ludwig.constants import AUDIO, COLUMN
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig
from ludwig.utils.server_utils import NumpyJSONResponse
logger = logging.getLogger(__name__)
try:
import uvicorn
from fastapi import FastAPI
from starlette.datastructures import UploadFile
from starlette.middleware import Middleware
from starlette.middleware.cors import CORSMiddleware
from starlette.requests import Request
except ImportError as e:
logger.error(e)
logger.error(
" fastapi and other serving dependencies cannot be loaded"
"and may have not been installed. "
"In order to install all serving dependencies run "
"pip install ludwig[serve]"
)
sys.exit(-1)
ALL_FEATURES_PRESENT_ERROR = {"error": "entry must contain all input features"}
COULD_NOT_RUN_INFERENCE_ERROR = {"error": "Unexpected Error: could not run inference on model"}
def server(model, allowed_origins=None):
middleware = [Middleware(CORSMiddleware, allow_origins=allowed_origins)] if allowed_origins else None
app = FastAPI(middleware=middleware)
config = model.config
input_features = {f[COLUMN] for f in config["input_features"]}
@app.get("/")
def check_health():
return NumpyJSONResponse({"message": "Ludwig server is up"})
@app.post("/predict")
async def predict(request: Request):
try:
form = await request.form()
entry, files = convert_input(form, model.model.input_features)
except Exception:
logger.exception("Failed to parse predict form")
return NumpyJSONResponse(COULD_NOT_RUN_INFERENCE_ERROR, status_code=500)
try:
if (entry.keys() & input_features) != input_features:
missing_features = set(input_features) - set(entry.keys())
return NumpyJSONResponse(
{
"error": "Data received does not contain all input features. "
f"Missing features: {missing_features}."
},
status_code=400,
)
try:
resp, _ = model.predict(dataset=[entry], data_format=dict)
resp = resp.to_dict("records")[0]
return NumpyJSONResponse(resp)
except Exception as exc:
logger.exception(f"Failed to run predict: {exc}")
return NumpyJSONResponse(COULD_NOT_RUN_INFERENCE_ERROR, status_code=500)
finally:
for f in files:
os.remove(f.name)
@app.post("/batch_predict")
async def batch_predict(request: Request):
try:
form = await request.form()
data, files = convert_batch_input(form, model.model.input_features)
data_df = pd.DataFrame.from_records(data["data"], index=data.get("index"), columns=data["columns"])
except Exception:
logger.exception("Failed to parse batch_predict form")
return NumpyJSONResponse(COULD_NOT_RUN_INFERENCE_ERROR, status_code=500)
if (set(data_df.columns) & input_features) != input_features:
missing_features = set(input_features) - set(data_df.columns)
return NumpyJSONResponse(
{
"error": "Data received does not contain all input features. "
f"Missing features: {missing_features}."
},
status_code=400,
)
try:
resp, _ = model.predict(dataset=data_df)
resp = resp.to_dict("split")
return NumpyJSONResponse(resp)
except Exception:
logger.exception("Failed to run batch_predict: {}")
return NumpyJSONResponse(COULD_NOT_RUN_INFERENCE_ERROR, status_code=500)
return app
def _write_file(v, files):
# Convert UploadFile to a NamedTemporaryFile to ensure it's on the disk
suffix = os.path.splitext(v.filename)[1]
named_file = tempfile.NamedTemporaryFile(delete=False, suffix=suffix)
files.append(named_file)
named_file.write(v.file.read())
named_file.close()
return named_file.name
def _read_image_buffer(v):
# read bytes sent via REST API and convert to image tensor
# in [channels, height, width] format
byte_string = io.BytesIO(v.file.read()).read()
image = decode_image(torch.frombuffer(byte_string, dtype=torch.uint8))
return image # channels, height, width
def convert_input(form, input_features):
"""Returns a new input and a list of files to be cleaned up."""
new_input = {}
files = []
for k, v in form.multi_items():
if isinstance(v, UploadFile):
# check if audio or image file
if input_features.get(k).type() == AUDIO:
new_input[k] = _write_file(v, files)
else:
new_input[k] = _read_image_buffer(v)
else:
new_input[k] = v
return new_input, files
def convert_batch_input(form, input_features):
"""Returns a new input and a list of files to be cleaned up."""
file_index = {}
files = []
for k, v in form.multi_items():
if isinstance(v, UploadFile):
file_index[v.filename] = v
data = json.loads(form["dataset"])
for row in data["data"]:
for i, value in enumerate(row):
if value in file_index:
feature_name = data["columns"][i]
if input_features.get(feature_name).type() == AUDIO:
row[i] = _write_file(file_index[value], files)
else:
row[i] = _read_image_buffer(file_index[value])
return data, files
def run_server(
model_path: str,
host: str,
port: int,
allowed_origins: list,
) -> None:
"""Loads a pre-trained model and serve it on an http server.
# Inputs
:param model_path: (str) filepath to pre-trained model.
:param host: (str, default: `0.0.0.0`) host ip address for the server to use.
:param port: (int, default: `8000`) port number for the server to use.
:param allowed_origins: (list) list of origins allowed to make cross-origin requests.
# Return
:return: (`None`)
"""
# Use local backend for serving to use pandas DataFrames.
model = LudwigModel.load(model_path, backend="local")
app = server(model, allowed_origins)
uvicorn.run(app, host=host, port=port)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script serves a pretrained model", prog="ludwig serve", usage="%(prog)s [options]"
)
# ----------------
# Model parameters
# ----------------
parser.add_argument("-m", "--model_path", help="model to load", required=True)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
# ----------------
# Server parameters
# ----------------
parser.add_argument(
"-p",
"--port",
help="port for server (default: 8000)",
default=8000,
type=int,
)
parser.add_argument("-H", "--host", help="host for server (default: 0.0.0.0)", default="0.0.0.0")
parser.add_argument(
"-ao",
"--allowed_origins",
nargs="*",
help="A list of origins that should be permitted to make cross-origin requests. "
'Use "*" to allow any origin. See https://www.starlette.io/middleware/#corsmiddleware.',
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("serve", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.serve")
print_ludwig("Serve", LUDWIG_VERSION)
run_server(args.model_path, args.host, args.port, args.allowed_origins)
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/train.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import sys
import pandas as pd
from ludwig.api import LudwigModel
from ludwig.backend import ALL_BACKENDS, Backend, initialize_backend
from ludwig.callbacks import Callback
from ludwig.constants import CONTINUE_PROMPT, HYPEROPT, HYPEROPT_WARNING
from ludwig.contrib import add_contrib_callback_args
from ludwig.globals import LUDWIG_VERSION
from ludwig.utils.data_utils import load_config_from_str, load_yaml
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.print_utils import get_logging_level_registry, print_ludwig, query_yes_no
logger = logging.getLogger(__name__)
def train_cli(
config: str | dict = None,
dataset: str | dict | pd.DataFrame = None,
training_set: str | dict | pd.DataFrame = None,
validation_set: str | dict | pd.DataFrame = None,
test_set: str | dict | pd.DataFrame = None,
training_set_metadata: str | dict = None,
data_format: str = None,
experiment_name: str = "api_experiment",
model_name: str = "run",
model_load_path: str = None,
model_resume_path: str = None,
skip_save_training_description: bool = False,
skip_save_training_statistics: bool = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
skip_save_processed_input: bool = False,
output_directory: str = "results",
gpus: str | int | list[int] = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
callbacks: list[Callback] = None,
backend: Backend | str = None,
random_seed: int = default_random_seed,
logging_level: int = logging.INFO,
**kwargs
) -> None:
"""*train* defines the entire training procedure used by Ludwig's internals. Requires most of the parameters
that are taken into the model. Builds a full ludwig model and performs the training.
:param config: (Union[str, dict]) in-memory representation of
config or string path to a YAML config file.
:param dataset: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing the entire dataset to be used for training.
If it has a split column, it will be used for splitting (0 for train,
1 for validation, 2 for test), otherwise the dataset will be
randomly split.
:param training_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing training data.
:param validation_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing validation data.
:param test_set: (Union[str, dict, pandas.DataFrame], default: `None`)
source containing test data.
:param training_set_metadata: (Union[str, dict], default: `None`)
metadata JSON file or loaded metadata. Intermediate preprocessed
structure containing the mappings of the input
dataset created the first time an input file is used in the same
directory with the same name and a '.meta.json' extension.
:param data_format: (str, default: `None`) format to interpret data
sources. Will be inferred automatically if not specified. Valid
formats are `'auto'`, `'csv'`, `'excel'`, `'feather'`,
`'fwf'`, `'hdf5'` (cache file produced during previous training),
`'html'` (file containing a single HTML `
`), `'json'`, `'jsonl'`,
`'parquet'`, `'pickle'` (pickled Pandas DataFrame), `'sas'`, `'spss'`,
`'stata'`, `'tsv'`.
:param experiment_name: (str, default: `'experiment'`) name for
the experiment.
:param model_name: (str, default: `'run'`) name of the model that is
being used.
:param model_load_path: (str, default: `None`) if this is specified the
loaded model will be used as initialization
(useful for transfer learning).
:param model_resume_path: (str, default: `None`) resumes training of
the model from the path specified. The config is restored.
In addition to config, training statistics, loss for each
epoch and the state of the optimizer are restored such that
training can be effectively continued from a previously interrupted
training process.
:param skip_save_training_description: (bool, default: `False`) disables
saving the description JSON file.
:param skip_save_training_statistics: (bool, default: `False`) disables
saving training statistics JSON file.
:param skip_save_model: (bool, default: `False`) disables
saving model weights and hyperparameters each time the model
improves. By default Ludwig saves model weights after each epoch
the validation metric improves, but if the model is really big
that can be time consuming. If you do not want to keep
the weights and just find out what performance a model can get
with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on and the returned model
will have the weights obtained at the end of training, instead of
the weights of the epoch with the best validation performance.
:param skip_save_progress: (bool, default: `False`) disables saving
progress each epoch. By default Ludwig saves weights and stats
after each epoch for enabling resuming of training, but if
the model is really big that can be time consuming and will uses
twice as much space, use this parameter to skip it, but training
cannot be resumed later on.
:param skip_save_log: (bool, default: `False`) disables saving
TensorBoard logs. By default Ludwig saves logs for the TensorBoard,
but if it is not needed turning it off can slightly increase the
overall speed.
:param skip_save_processed_input: (bool, default: `False`) if input
dataset is provided it is preprocessed and cached by saving an HDF5
and JSON files to avoid running the preprocessing again. If this
parameter is `False`, the HDF5 and JSON file are not saved.
:param output_directory: (str, default: `'results'`) the directory that
will contain the training statistics, TensorBoard logs, the saved
model and the training progress files.
:param gpus: (list, default: `None`) list of GPUs that are available
for training.
:param gpu_memory_limit: (float: default: `None`) maximum memory fraction
[0, 1] allowed to allocate per GPU device.
:param allow_parallel_threads: (bool, default: `True`) allow PyTorch
to use multithreading parallelism to improve performance at
the cost of determinism.
:param callbacks: (list, default: `None`) a list of
`ludwig.callbacks.Callback` objects that provide hooks into the
Ludwig pipeline.
:param backend: (Union[Backend, str]) `Backend` or string name
of backend to use to execute preprocessing / training steps.
:param random_seed: (int: default: 42) random seed used for weights
initialization, splits and any other random function.
:param logging_level: (int) Log level that will be sent to stderr.
# Return
:return: (`None`)
"""
if HYPEROPT in config:
if not query_yes_no(HYPEROPT_WARNING + CONTINUE_PROMPT):
exit(1)
# Stop gap: remove hyperopt from the config to prevent interference with training step sizes
# TODO: https://github.com/ludwig-ai/ludwig/issues/2633
# Need to investigate why the presence of hyperopt in the config interferes with training step sizes
config.pop(HYPEROPT)
if model_load_path:
model = LudwigModel.load(
model_load_path,
logging_level=logging_level,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
)
else:
model = LudwigModel(
config=config,
logging_level=logging_level,
backend=backend,
gpus=gpus,
gpu_memory_limit=gpu_memory_limit,
allow_parallel_threads=allow_parallel_threads,
callbacks=callbacks,
)
model.train(
dataset=dataset,
training_set=training_set,
validation_set=validation_set,
test_set=test_set,
training_set_metadata=training_set_metadata,
data_format=data_format,
experiment_name=experiment_name,
model_name=model_name,
model_resume_path=model_resume_path,
skip_save_training_description=skip_save_training_description,
skip_save_training_statistics=skip_save_training_statistics,
skip_save_model=skip_save_model,
skip_save_progress=skip_save_progress,
skip_save_log=skip_save_log,
skip_save_processed_input=skip_save_processed_input,
output_directory=output_directory,
random_seed=random_seed,
)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script trains a model", prog="ludwig train", usage="%(prog)s [options]"
)
# ----------------------------
# Experiment naming parameters
# ----------------------------
parser.add_argument("--output_directory", type=str, default="results", help="directory that contains the results")
parser.add_argument("--experiment_name", type=str, default="experiment", help="experiment name")
parser.add_argument("--model_name", type=str, default="run", help="name for the model")
# ---------------
# Data parameters
# ---------------
parser.add_argument(
"--dataset",
help="input data file path. "
"If it has a split column, it will be used for splitting "
"(0: train, 1: validation, 2: test), "
"otherwise the dataset will be randomly split",
)
parser.add_argument("--training_set", help="input train data file path")
parser.add_argument("--validation_set", help="input validation data file path")
parser.add_argument("--test_set", help="input test data file path")
parser.add_argument(
"--training_set_metadata",
help="input metadata JSON file path. An intermediate preprocessed file "
"containing the mappings of the input file created "
"the first time a file is used, in the same directory "
"with the same name and a .json extension",
)
parser.add_argument(
"--data_format",
help="format of the input data",
default="auto",
choices=[
"auto",
"csv",
"excel",
"feather",
"fwf",
"hdf5",
"html" "tables",
"json",
"jsonl",
"parquet",
"pickle",
"sas",
"spss",
"stata",
"tsv",
],
)
parser.add_argument(
"-sspi",
"--skip_save_processed_input",
help="skips saving intermediate HDF5 and JSON files",
action="store_true",
default=False,
)
# ----------------
# Model parameters
# ----------------
config = parser.add_mutually_exclusive_group(required=True)
config.add_argument(
"-c",
"--config",
type=load_yaml,
help="Path to the YAML file containing the model configuration",
)
config.add_argument(
"-cs",
"--config_str",
dest="config",
type=load_config_from_str,
help="JSON or YAML serialized string of the model configuration",
)
parser.add_argument("-mlp", "--model_load_path", help="path of a pretrained model to load as initialization")
parser.add_argument("-mrp", "--model_resume_path", help="path of the model directory to resume training of")
parser.add_argument(
"-sstd",
"--skip_save_training_description",
action="store_true",
default=False,
help="disables saving the description JSON file",
)
parser.add_argument(
"-ssts",
"--skip_save_training_statistics",
action="store_true",
default=False,
help="disables saving training statistics JSON file",
)
parser.add_argument(
"-ssm",
"--skip_save_model",
action="store_true",
default=False,
help="disables saving weights each time the model improves. "
"By default Ludwig saves weights after each epoch "
"the validation metric (improves, but if the model is really big "
"that can be time consuming. If you do not want to keep "
"the weights and just find out what performance a model can get "
"with a set of hyperparameters, use this parameter to skip it",
)
parser.add_argument(
"-ssp",
"--skip_save_progress",
action="store_true",
default=False,
help="disables saving weights after each epoch. By default ludwig saves "
"weights after each epoch for enabling resuming of training, but "
"if the model is really big that can be time consuming and will "
"save twice as much space, use this parameter to skip it",
)
parser.add_argument(
"-ssl",
"--skip_save_log",
action="store_true",
default=False,
help="disables saving TensorBoard logs. By default Ludwig saves "
"logs for the TensorBoard, but if it is not needed turning it off "
"can slightly increase the overall speed",
)
# ------------------
# Runtime parameters
# ------------------
parser.add_argument(
"-rs",
"--random_seed",
type=int,
default=42,
help="a random seed that is going to be used anywhere there is a call "
"to a random number generator: data splitting, parameter "
"initialization and training set shuffling",
)
parser.add_argument("-g", "--gpus", nargs="+", type=int, default=None, help="list of gpus to use")
parser.add_argument(
"-gml",
"--gpu_memory_limit",
type=float,
default=None,
help="maximum memory fraction [0, 1] allowed to allocate per GPU device",
)
parser.add_argument(
"-dpt",
"--disable_parallel_threads",
action="store_false",
dest="allow_parallel_threads",
help="disable PyTorch from using multithreading for reproducibility",
)
parser.add_argument(
"-b",
"--backend",
help="specifies backend to use for parallel / distributed execution, " "defaults to local execution",
choices=ALL_BACKENDS,
)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="the level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("train", *sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.train")
args.backend = initialize_backend(args.backend or args.config.get("backend"))
if args.backend.is_coordinator():
print_ludwig("Train", LUDWIG_VERSION)
train_cli(**vars(args))
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/trainers/__init__.py
================================================
# register trainers
import ludwig.trainers.trainer # noqa: F401
try:
import ludwig.trainers.trainer_llm # noqa: F401
except ImportError:
pass
================================================
FILE: ludwig/trainers/base.py
================================================
from abc import ABC, abstractmethod
from ludwig.data.dataset.base import Dataset
from ludwig.globals import MODEL_FILE_NAME
from ludwig.schema.trainer import BaseTrainerConfig
from ludwig.types import ModelConfigDict
from ludwig.utils.defaults import default_random_seed
class BaseTrainer(ABC):
@abstractmethod
def train(self, training_set, validation_set=None, test_set=None, save_path=MODEL_FILE_NAME, **kwargs):
raise NotImplementedError()
@abstractmethod
def train_online(
self,
dataset,
):
raise NotImplementedError()
@abstractmethod
def tune_batch_size(
self,
config: ModelConfigDict,
training_set: Dataset,
random_seed: int = default_random_seed,
max_trials: int = 10,
halving_limit: int = 3,
tune_for_training: bool = True,
) -> int:
raise NotImplementedError()
@property
@abstractmethod
def validation_field(self):
raise NotImplementedError()
@property
@abstractmethod
def validation_metric(self):
raise NotImplementedError()
# Remote implementations may override this
def shutdown(self):
pass
@property
def local_rank(self) -> int:
return 0
def barrier(self):
pass
# Functions needed to treat Trainer as a context manager
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.shutdown()
@staticmethod
@abstractmethod
def get_schema_cls() -> BaseTrainerConfig:
raise NotImplementedError()
================================================
FILE: ludwig/trainers/registry.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.utils.registry import DEFAULT_KEYS, Registry
_trainers_registry = Registry()
_ray_trainers_registry = Registry()
_llm_trainers_registry = Registry()
_llm_ray_trainers_registry = Registry()
@DeveloperAPI
def get_trainers_registry() -> Registry:
return _trainers_registry
@DeveloperAPI
def get_ray_trainers_registry() -> Registry:
return _ray_trainers_registry
@DeveloperAPI
def get_llm_trainers_registry() -> Registry:
return _llm_trainers_registry
@DeveloperAPI
def get_llm_ray_trainers_registry() -> Registry:
return _llm_ray_trainers_registry
@DeveloperAPI
def register_trainer(model_type: str, default=False):
"""Register a trainer class that supports training the given model types.
Using default=True will make the trainer the default trainer for the model type.
Args:
model_type: The model_type which dictates the trainer type to use.
default: Whether the trainer should be the default trainer for the model type.
"""
def wrap(cls):
_trainers_registry[model_type] = cls
if default:
if DEFAULT_KEYS[0] in _trainers_registry:
raise ValueError(f"Default trainer already registered for model type {model_type}")
for key in DEFAULT_KEYS:
_trainers_registry[key] = cls
return cls
return wrap
@DeveloperAPI
def register_ray_trainer(model_type: str, default=False):
"""Register a trainer class that supports training the given model types with Ray backend.
Using default=True will make the trainer the default trainer for the model type.
Args:
model_type: The model_type which dictates the trainer type to use.
default: Whether the trainer should be the default trainer for the model type.
"""
def wrap(cls):
_ray_trainers_registry[model_type] = cls
if default:
if DEFAULT_KEYS[0] in _ray_trainers_registry:
raise ValueError(f"Default trainer already registered for model type {model_type}")
for key in DEFAULT_KEYS:
_ray_trainers_registry[key] = cls
return cls
return wrap
@DeveloperAPI
def register_llm_trainer(trainer_type: str, default=False):
"""Register a trainer class that supports training the specific type of training strategy for LLM Models.
Using default=True will make the trainer the default trainer for the LLM model type.
Args:
trainer_type: The trainer_type which dictates what training strategy to use.
default: Whether the trainer should be the default trainer for LLMs.
"""
def wrap(cls):
_llm_trainers_registry[trainer_type] = cls
if default:
if DEFAULT_KEYS[0] in _trainers_registry:
raise ValueError(f"Default trainer {trainer_type} already registered for LLM")
for key in DEFAULT_KEYS:
_llm_trainers_registry[key] = cls
return cls
return wrap
@DeveloperAPI
def register_llm_ray_trainer(trainer_type: str, default=False):
"""Register a trainer class that supports training the specific type of training strategy for LLM Models with
Ray backend.
Using default=True will make the trainer the default trainer for the LLM model type.
Args:
trainer_type: The trainer_type which dictates what training strategy to use.
default: Whether the trainer should be the default trainer for LLMs.
"""
def wrap(cls):
_llm_ray_trainers_registry[trainer_type] = cls
if default:
if DEFAULT_KEYS[0] in _trainers_registry:
raise ValueError(f"Default ray trainer {trainer_type} already registered for LLM")
for key in DEFAULT_KEYS:
_llm_ray_trainers_registry[key] = cls
return cls
return wrap
================================================
FILE: ludwig/trainers/trainer.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""This module contains the class and auxiliary methods of a model."""
import contextlib
import csv
import logging
import math
import os
import os.path
import signal
import sys
import tempfile
import threading
import time
from collections.abc import Callable
import numpy as np
import packaging
import pandas as pd
import psutil
import torch
from torch.utils.tensorboard import SummaryWriter
from ludwig.constants import (
AUTO,
LOSS,
MAX_CPU_BATCH_SIZE,
MINIMIZE,
MODEL_ECD,
MODEL_LLM,
TEST,
TRAINING,
USED_TOKENS,
VALIDATION,
)
from ludwig.data.dataset.base import Dataset
from ludwig.distributed.base import DistributedStrategy, LocalStrategy
from ludwig.globals import (
is_progressbar_disabled,
MODEL_FILE_NAME,
MODEL_HYPERPARAMETERS_FILE_NAME,
TRAINING_CHECKPOINTS_DIR_PATH,
TRAINING_PROGRESS_TRACKER_FILE_NAME,
)
from ludwig.models.ecd import ECD
from ludwig.models.llm import LLM
from ludwig.models.predictor import Predictor
from ludwig.modules.lr_scheduler import LRScheduler
from ludwig.modules.metric_modules import get_improved_fn, get_initial_validation_value
from ludwig.modules.metric_registry import get_metric_objective
from ludwig.modules.optimization_modules import create_clipper
from ludwig.progress_bar import LudwigProgressBar
from ludwig.schema.trainer import ECDTrainerConfig
from ludwig.trainers.base import BaseTrainer
from ludwig.trainers.registry import register_trainer
from ludwig.types import ModelConfigDict
from ludwig.utils import time_utils
from ludwig.utils.batch_size_tuner import BatchSizeEvaluator
from ludwig.utils.checkpoint_utils import Checkpoint, CheckpointManager
from ludwig.utils.config_utils import get_quantization
from ludwig.utils.data_utils import load_json
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.fs_utils import path_exists
from ludwig.utils.llm_utils import update_embedding_layer
from ludwig.utils.metric_utils import get_metric_names, TrainerMetric
from ludwig.utils.metrics_printed_table import print_metrics_table
from ludwig.utils.misc_utils import set_random_seed
from ludwig.utils.model_utils import contains_nan_or_inf_tensors
from ludwig.utils.torch_utils import get_torch_device
from ludwig.utils.trainer_utils import (
append_metrics,
freeze_layers_regex,
get_final_steps_per_checkpoint,
get_latest_metrics_dict,
get_new_progress_tracker,
get_total_expected_checkpoints,
get_total_steps,
ProgressTracker,
)
logger = logging.getLogger(__name__)
_TORCH210 = packaging.version.parse(torch.__version__) >= packaging.version.parse("2.1.0")
@register_trainer(MODEL_ECD, default=True)
class Trainer(BaseTrainer):
"""Trainer is a class that trains a model."""
@staticmethod
def get_schema_cls():
return ECDTrainerConfig
def __init__(
self,
config: ECDTrainerConfig,
model: ECD,
resume: float = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
callbacks: list = None,
report_tqdm_to_ray=False,
random_seed: float = default_random_seed,
distributed: DistributedStrategy | None = None,
device: str | None = None,
**kwargs,
):
"""Trains a model with a set of options and hyperparameters listed below. Customizable.
:param model: Underlying Ludwig model
:type model: `ludwig.models.ecd.ECD`
:param resume: Resume training a model that was being trained. (default: False).
:type resume: Boolean
:param skip_save_model: Disables saving model weights and hyperparameters each time the model improves. By
default Ludwig saves model weights after each round of evaluation the validation metric (improves, but
if the model is really big that can be time consuming. If you do not want to keep the weights and just
find out what performance a model can get with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on. (default: False).
:type skip_save_model: Boolean
:param skip_save_progress: Disables saving progress each round of evaluation. By default Ludwig saves weights
and stats after each round of evaluation for enabling resuming of training, but if the model is really
big that can be time consuming and will uses twice as much space, use this parameter to skip it, but
training cannot be resumed later on. (default: False).
:type skip_save_progress: Boolean
:param skip_save_log: Disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if
it is not needed turning it off can slightly increase the overall speed. (default: False).
:type skip_save_log: Boolean
:param callbacks: List of `ludwig.callbacks.Callback` objects that provide hooks into the Ludwig pipeline.
(default: None).
:type callbacks: list
:param report_tqdm_to_ray: Enables using the ray based tqdm Callback for progress bar reporting
:param random_seed: Default initialization for the random seeds (default: 42).
:type random_seed: Float
:param distributed: Distributed strategy (default: None).
:type distributed: `DistributedStrategy`
:param device: Device to load the model on from a saved checkpoint (default: None).
:type device: str
:param config: `ludwig.schema.trainer.BaseTrainerConfig` instance that specifies training hyperparameters
(default: `ludwig.schema.trainer.ECDTrainerConfig()`).
"""
self.distributed = distributed if distributed is not None else LocalStrategy()
self.epochs = config.epochs
self.train_steps = config.train_steps
self.enable_profiling = config.enable_profiling
self.steps_per_epoch = 0 # Computed during training, after batcher has been initialized.
self.total_steps = 0 # Computed during training, after batcher has been initialized.
self.total_expected_checkpoints = 0 # Computed during training, after batcher has been initialized.
self.regularization_lambda = config.regularization_lambda
self.regularization_type = config.regularization_type
self.batch_size = config.batch_size
self.effective_batch_size = config.effective_batch_size
self.max_batch_size = config.max_batch_size
self.eval_batch_size = config.batch_size if config.eval_batch_size is None else config.eval_batch_size
self.should_shuffle = config.should_shuffle
self._validation_field = config.validation_field
self._validation_metric = config.validation_metric
self.early_stop = config.early_stop
self.layers_to_freeze_regex = config.layers_to_freeze_regex
self.steps_per_checkpoint = config.steps_per_checkpoint
self.checkpoints_per_epoch = config.checkpoints_per_epoch
self.evaluate_training_set = config.evaluate_training_set
self.skip_all_evaluation = config.skip_all_evaluation
self.increase_batch_size_on_plateau = config.increase_batch_size_on_plateau
self.increase_batch_size_on_plateau_patience = config.increase_batch_size_on_plateau_patience
self.increase_batch_size_on_plateau_rate = config.increase_batch_size_on_plateau_rate
self.increase_batch_size_eval_metric = config.increase_batch_size_eval_metric
self.increase_batch_size_eval_split = config.increase_batch_size_eval_split
self.gradient_accumulation_steps = (
config.gradient_accumulation_steps
if self.distributed.allow_gradient_accumulation() and config.gradient_accumulation_steps != AUTO
else 1
)
self.resume = resume
self.skip_save_model = skip_save_model
self.skip_save_progress = skip_save_progress
self.skip_save_log = skip_save_log
self.random_seed = random_seed
self.received_sigint = False
self.report_tqdm_to_ray = report_tqdm_to_ray
self.callbacks = callbacks or []
self.device = device
if self.device is None:
self.device = get_torch_device()
self.model = model
self.model.prepare_for_training()
self.model = self.distributed.to_device(self.model)
self.model.metrics_to_device(self.device)
self.compiled_model = self.model
if config.compile:
self.compiled_model = torch.compile(self.model)
logger.info("Training with torchdynamo compiled model")
# ================ Optimizer tuning ================
self.gradient_clipping_config = create_clipper(config.gradient_clipping)
self.config = config
self.base_learning_rate = None
self.dist_model = None
self.optimizer = None
self.scheduler = None
self.prepare()
# Setup for automatic mixed precision (AMP)
self.use_amp = config.use_mixed_precision and self.distributed.allow_mixed_precision()
if self.use_amp:
if torch.cuda.is_available():
logger.info("Enabling automatic mixed precision (AMP)")
else:
logger.info("`trainer.use_mixed_precision=True`, but no GPU device found. Setting to `False`")
self.use_amp = False
self.scaler = torch.amp.GradScaler("cuda") if self.use_amp else None
# when training starts the sigint handler will be replaced with
# set_steps_to_1_or_quit so this is needed to remember
# the original sigint to restore at the end of training
# and before set_steps_to_1_or_quit returns
self.original_sigint_handler = None
def prepare(self):
base_learning_rate = self.config.learning_rate
if self.distributed:
lr_scale_fn = learning_rate_scale_fns[self.config.learning_rate_scaling]
base_learning_rate *= lr_scale_fn(self.distributed.size())
self.base_learning_rate = base_learning_rate
# Given that regex is supplied, freeze layers
if self.config.layers_to_freeze_regex:
freeze_layers_regex(self.config, self.model)
# We may need to replace the embedding layer when using 8-bit optimizers from bitsandbytes.
update_embedding_layer(self.compiled_model, self.config)
# Register any post forward hooks for the model
self.compiled_model._activate_forward_hooks()
# Enable gradient checkpointing if configured
if self.config.enable_gradient_checkpointing:
# TODO(Arnav): Add support for gradient checkpointing in the compiled model
# when the model is an ECD model using torch.utils.checkpoint (torch.utils.checkpoint.sequential())
if not isinstance(self.compiled_model, LLM):
logger.warning("Gradient checkpointing is currently only supported for model_type: llm. Skipping...")
elif not hasattr(self.compiled_model, "model") and not hasattr(
self.compiled_model.model, "gradient_checkpointing_enable"
):
logger.warning("Gradient checkpointing is not supported by this model. Skipping...")
elif hasattr(self.compiled_model.model, "gradient_checkpointing_enable"):
if _TORCH210:
# https://pytorch.org/docs/stable/checkpoint.html
# https://github.com/huggingface/transformers/blob/02f8738ef8c674300c314d004ba436cb5aaca165/src/transformers/modeling_utils.py#L2094 # noqa: E501
self.compiled_model.model.gradient_checkpointing_enable(
gradient_checkpointing_kwargs={"use_reentrant": True}
)
else:
self.compiled_model.model.gradient_checkpointing_enable()
# `use_cache=True` is incompatible with gradient checkpointing.
self.compiled_model.model.config.use_cache = False
self.compiled_model.model.enable_input_require_grads()
logger.info("Gradient checkpointing enabled for training.")
else:
raise RuntimeError("Error when trying to enable gradient checkpointing.")
self.dist_model, self.optimizer = self.distributed.prepare(
self.compiled_model,
self.config,
self.base_learning_rate,
)
# NOTE: This is a partially configured LRScheduler. It will be updated in the first call to train_step.
self.scheduler = LRScheduler(self.config.learning_rate_scheduler, self.optimizer, 0, 0)
def train_step(
self,
inputs: dict[str, torch.Tensor],
targets: dict[str, torch.Tensor],
should_step: bool = True,
profiler: torch.profiler.profile | None = None,
) -> tuple[torch.Tensor, dict[str, torch.Tensor], torch.Tensor]:
"""Performs a single training step.
Params:
inputs: A dictionary of input data, from feature name to tensor.
targets: A dictionary of target data, from feature name to tensor.
should_step: Whether to perform a step of the optimizer after computing gradients.
Returns:
A tuple of:
1. loss tensor
2. dictionary of loss for every output feature.
3. tokens usage tensor
"""
if isinstance(self.optimizer, torch.optim.LBFGS):
# NOTE: AMP is not supported for L-BFGS yet.
# NOTE: gradient accumulation is not supported for L-BFGS yet.
def closure():
# Allows L-BFGS to reevaluate the loss function
self.distributed.zero_grad(self.optimizer)
model_outputs = self.dist_model((inputs, targets))
loss, _ = self.model.train_loss(
targets, model_outputs, self.regularization_type, self.regularization_lambda
)
loss.backward()
return loss
self.distributed.step(self.optimizer, closure)
# Obtain model predictions and loss
model_outputs = self.dist_model((inputs, targets))
loss, all_losses = self.model.train_loss(
targets, model_outputs, self.regularization_type, self.regularization_lambda
)
if not self.evaluate_training_set:
# Update evaluation metrics with current model params:
# noisy but fast way to get metrics on the training set
predictions = self.model.outputs_to_predictions(model_outputs)
self.model.update_metrics(targets, predictions)
return loss, all_losses, model_outputs[USED_TOKENS]
with torch.amp.autocast("cuda") if self.use_amp else contextlib.nullcontext():
with self.distributed.prepare_model_update(self.dist_model, should_step=should_step):
# Obtain model predictions and loss
model_outputs = self.dist_model((inputs, targets))
loss, all_losses = self.model.train_loss(
targets, model_outputs, self.regularization_type, self.regularization_lambda
)
loss = loss / self.gradient_accumulation_steps
used_tokens = model_outputs[USED_TOKENS]
# Begin the backward pass
variables = self.dist_model.parameters()
if self.use_amp:
self.scaler.scale(loss).backward()
else:
self.distributed.backward(loss, self.dist_model)
if not should_step:
# Short-circuit the parameter updates if we are still accumulating gradients
return loss, all_losses, used_tokens
# Wait for gradient aggregation to complete before clipping the gradients.
# When using AMP, we need to do this before unscaling.
self.distributed.wait_optimizer_synced(self.optimizer)
if self.use_amp:
# In-place unscaling of all gradients before weights update
# Do this before gradient clipping per docs:
# https://pytorch.org/docs/master/notes/amp_examples.html#gradient-clipping
self.scaler.unscale_(self.optimizer)
if self.distributed.allow_clip_gradients():
# Clip gradients
self.clip_grads(variables)
# Apply gradient updates
with self.distributed.prepare_optimizer_update(self.optimizer):
# Because we already synchronized above, we skip doing so here
if self.use_amp:
self.scaler.step(self.optimizer)
else:
self.distributed.step(self.optimizer)
if self.use_amp:
# Update scaler in case of overflow/underflow
self.scaler.update()
if not self.evaluate_training_set:
# Update evaluation metrics with current model params:
# noisy but fast way to get metrics on the training set
predictions = self.model.outputs_to_predictions(model_outputs)
self.model.update_metrics(targets, predictions)
self.distributed.zero_grad(self.optimizer)
if profiler:
profiler.step()
return loss, all_losses, used_tokens
def clip_grads(self, variables):
"""Applies gradient clipping."""
if self.gradient_clipping_config.clipglobalnorm:
torch.nn.utils.clip_grad_norm_(variables, self.gradient_clipping_config.clipglobalnorm)
if self.gradient_clipping_config.clipnorm:
torch.nn.utils.clip_grad_norm_(variables, self.gradient_clipping_config.clipnorm)
if self.gradient_clipping_config.clipvalue:
torch.nn.utils.clip_grad_value_(variables, self.gradient_clipping_config.clipvalue)
@classmethod
def write_eval_summary(
cls,
summary_writer,
metrics,
step,
):
if not summary_writer:
return
for feature_name, output_feature in metrics.items():
for metric_name, metrics in output_feature.items():
if metrics:
metric_tag = f"{feature_name}/epoch_{metric_name}"
metric_val = metrics[-1][-1]
summary_writer.add_scalar(metric_tag, metric_val, global_step=step)
summary_writer.flush()
@classmethod
def write_step_summary(
cls, train_summary_writer, combined_loss, all_losses, step, used_tokens, total_tokens_used, learning_rate=None
):
if not train_summary_writer:
return
# token information.
train_summary_writer.add_scalar("tokens/tokens", used_tokens, global_step=step)
train_summary_writer.add_scalar("tokens/total_tokens_used", total_tokens_used, global_step=step)
# combined loss
train_summary_writer.add_scalar("combined/step_training_loss", combined_loss, global_step=step)
# all other losses
for feature_name, loss in all_losses.items():
loss_tag = f"{feature_name}/step_training_loss"
train_summary_writer.add_scalar(loss_tag, loss.detach().float(), global_step=step)
if learning_rate:
train_summary_writer.add_scalar("combined/step_learning_rate", learning_rate, global_step=step)
# Log CUDA memory stats.
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
device = torch.device(f"cuda:{i}")
memory_stats = torch.cuda.memory_stats(device=device)
gb_memory_stats = {k: v / (1000**3) for k, v in memory_stats.items()}
# Allocated bytes.
train_summary_writer.add_scalar(
f"cuda/device{i}/allocated_gb.all.current",
gb_memory_stats["allocated_bytes.all.current"],
global_step=step,
)
train_summary_writer.add_scalar(
f"cuda/device{i}/allocated_gb.all.peak",
gb_memory_stats["allocated_bytes.all.peak"],
global_step=step,
)
train_summary_writer.add_scalar(
f"cuda/device{i}/allocated_gb.all.allocated",
gb_memory_stats["allocated_bytes.all.allocated"],
global_step=step,
)
train_summary_writer.add_scalar(
f"cuda/device{i}/allocated_gb.all.freed",
gb_memory_stats["allocated_bytes.all.freed"],
global_step=step,
)
# Reserved bytes.
train_summary_writer.add_scalar(
f"cuda/device{i}/reserved_gb.all.current",
gb_memory_stats["reserved_bytes.all.current"],
global_step=step,
)
train_summary_writer.add_scalar(
f"cuda/device{i}/reserved_gb.all.peak", gb_memory_stats["reserved_bytes.all.peak"], global_step=step
)
train_summary_writer.add_scalar(
f"cuda/device{i}/reserved_gb.all.allocated",
gb_memory_stats["reserved_bytes.all.allocated"],
global_step=step,
)
train_summary_writer.add_scalar(
f"cuda/device{i}/reserved_gb.all.freed",
gb_memory_stats["reserved_bytes.all.freed"],
global_step=step,
)
# Active bytes.
train_summary_writer.add_scalar(
f"cuda/device{i}/active_gb.all.current",
gb_memory_stats["active_bytes.all.current"],
global_step=step,
)
train_summary_writer.add_scalar(
f"cuda/device{i}/active_gb.all.peak", gb_memory_stats["active_bytes.all.peak"], global_step=step
)
train_summary_writer.add_scalar(
f"cuda/device{i}/active_gb.all.allocated",
gb_memory_stats["active_bytes.all.allocated"],
global_step=step,
)
train_summary_writer.add_scalar(
f"cuda/device{i}/active_gb.all.freed", gb_memory_stats["active_bytes.all.freed"], global_step=step
)
# Global free memory.
train_summary_writer.add_scalar(
f"cuda/device{i}/global_free_memory_gb",
torch.cuda.mem_get_info(device=device)[0] / (1000**3),
global_step=step,
)
# Total memory occupied.
train_summary_writer.add_scalar(
f"cuda/device{i}/total_memory_occupied_gb",
torch.cuda.mem_get_info(device=device)[1] / (1000**3),
global_step=step,
)
# Total memory used.
train_summary_writer.add_scalar(
f"cuda/device{i}/total_memory_used_gb",
(torch.cuda.mem_get_info(device=device)[1] - torch.cuda.mem_get_info(device=device)[0]) / (1000**3),
global_step=step,
)
# Utilization.
# https://pytorch.org/docs/stable/generated/torch.cuda.utilization.html#torch.cuda.utilization
train_summary_writer.add_scalar(
f"cuda/device{i}/utilization",
torch.cuda.utilization(device=device),
global_step=step,
)
train_summary_writer.flush()
def is_cpu_training(self):
return torch.device(self.device) == torch.device("cpu")
def tune_batch_size(
self,
config: ModelConfigDict,
training_set: Dataset,
random_seed: int = default_random_seed,
max_trials: int = 20,
halving_limit: int = 3,
snapshot_weights: bool = True,
on_best_batch_size_updated: Callable[[int, float, int], None] | None = None,
tune_for_training: bool = True,
global_max_sequence_length: int | None = None,
) -> int:
logger.info("Tuning batch size...")
skip_save_model = self.skip_save_model
skip_save_progress = self.skip_save_progress
skip_save_log = self.skip_save_log
# Set temporary values
self.skip_save_model = True
self.skip_save_progress = True
self.skip_save_log = True
# When training on CPU, larger batch sizes offer limited benefits due to lack of effective
# parallelization within a batch. As such, to increase chances of stable training, we cap the maximum
# batch size at MAX_CPU_BATCH_SIZE
max_batch_size = (
self.max_batch_size if torch.cuda.is_available() else min(self.max_batch_size, MAX_CPU_BATCH_SIZE)
)
if self.effective_batch_size != AUTO:
# If an effective batch size is set, we must ensure that batch size tuning doesn't exceed it
max_batch_size = min(self.effective_batch_size, max_batch_size)
if not tune_for_training:
# No need to save and restore model and optimizer states, as they aren't modified during predict
snapshot_weights = False
self.dist_model.train() # Sets model training mode.
evaluator = (
self._create_batch_size_evaluator() if tune_for_training else self._create_predict_batch_size_evaluator()
)
with tempfile.TemporaryDirectory() as tmpdir:
if snapshot_weights:
# Save a snapshot of the model and optimizer state to restore later, as they will be modified
# when we call the train step as part of the auto-tuning. This is undesirable, particularly for
# pretrained models.
checkpoint = self.distributed.create_checkpoint_handle(
dist_model=self.dist_model, model=self.model, optimizer=self.optimizer, scheduler=self.scheduler
)
checkpoint.save(os.path.join(tmpdir, "latest.ckpt"), global_step=0)
try:
best_batch_size = evaluator.select_best_batch_size(
len(training_set), max_batch_size, max_trials, self.is_coordinator(), global_max_sequence_length
)
best_batch_size = self.distributed.broadcast_object(best_batch_size)
if tune_for_training:
# Update batch size / gradient accumulation before preparing the trainer. This is needed primarily
# for DeepSpeed, which needs to know the batch size and gradient accumulation steps before init
self.config.batch_size = best_batch_size
self.config.update_batch_size_grad_accum(self.distributed.size())
self.batch_size = self.config.batch_size
self.gradient_accumulation_steps = self.config.gradient_accumulation_steps
return best_batch_size
finally:
# Restore original parameters to defaults
self.skip_save_model = skip_save_model
self.skip_save_progress = skip_save_progress
self.skip_save_log = skip_save_log
if snapshot_weights:
# Restore the model weights prior to batch size tuning to undo any updates made to the weights
if self.distributed.prepare_before_load():
# Some distributed strategies, like DeepSpeed, need to re-init before loading the model
self.prepare()
self.resume_weights_and_optimizer(str(tmpdir), checkpoint)
def _create_batch_size_evaluator(self) -> BatchSizeEvaluator:
trainer = self
class _TrainerBatchSizeEvaluator(BatchSizeEvaluator):
def reset(self):
trainer.model.reset_metrics()
trainer.optimizer.zero_grad()
def step(self, batch_size: int, global_max_sequence_length: int | None = None):
trainer.distributed.set_batch_size(trainer.dist_model, batch_size)
inputs = {
input_feature_name: input_feature.create_sample_input(batch_size=batch_size).to(trainer.device)
for input_feature_name, input_feature in trainer.model.input_features.items()
}
targets = {
output_feature_name: output_feature.create_sample_output(batch_size=batch_size).to(trainer.device)
for output_feature_name, output_feature in trainer.model.output_features.items()
}
trainer.train_step(inputs, targets)
return _TrainerBatchSizeEvaluator()
def _create_predict_batch_size_evaluator(self) -> BatchSizeEvaluator:
trainer = self
class _PredictBatchSizeEvaluator(BatchSizeEvaluator):
def reset(self):
trainer.model.reset_metrics()
trainer.optimizer.zero_grad()
def step(self, batch_size: int, global_max_sequence_length: int | None = None):
trainer.distributed.set_batch_size(trainer.dist_model, batch_size)
inputs = {
input_feature_name: input_feature.create_sample_input(batch_size=batch_size).to(trainer.device)
for input_feature_name, input_feature in trainer.model.input_features.items()
}
targets = {
output_feature_name: output_feature.create_sample_output(batch_size=batch_size).to(trainer.device)
for output_feature_name, output_feature in trainer.model.output_features.items()
}
with torch.no_grad():
trainer.dist_model((inputs, targets))
return _PredictBatchSizeEvaluator()
def run_evaluation(
self,
training_set,
validation_set,
test_set,
progress_tracker: ProgressTracker,
train_summary_writer,
validation_summary_writer,
test_summary_writer,
model_hyperparameters_path,
output_features,
metrics_names,
save_path,
loss: torch.Tensor,
all_losses: dict[str, torch.Tensor],
early_stopping_steps: int,
checkpoint_manager: CheckpointManager,
) -> bool:
"""Runs evaluation over training, validation, and test sets.
Also:
- Prints results, saves results to the progress tracker.
- Saves the model if the validation score is the best so far
- If there is no validation set, the model is always saved.
Returns whether the trainer should early stop, based on validation metrics history.
"""
start_time = time.time()
self.callback(lambda c: c.on_eval_start(self, progress_tracker, save_path))
if self.is_coordinator():
logger.info(f"\nRunning evaluation for step: {progress_tracker.steps}, epoch: {progress_tracker.epoch}")
# ================ Eval ================
# eval metrics on train
self.eval_batch_size = max(self.eval_batch_size, progress_tracker.batch_size)
if self.evaluate_training_set:
# Run a separate pass over the training data to compute metrics
self.evaluation(
training_set, "train", progress_tracker.train_metrics, self.eval_batch_size, progress_tracker
)
else:
# Use metrics accumulated during training
metrics = self.model.get_metrics()
append_metrics(self.model, "train", metrics, progress_tracker.train_metrics, progress_tracker)
self.model.reset_metrics()
self.write_eval_summary(
summary_writer=train_summary_writer,
metrics=progress_tracker.train_metrics,
step=progress_tracker.steps,
)
if validation_set is not None:
self.callback(lambda c: c.on_validation_start(self, progress_tracker, save_path))
# eval metrics on validation set
self.evaluation(
validation_set,
VALIDATION,
progress_tracker.validation_metrics,
self.eval_batch_size,
progress_tracker,
)
llm_eval_examples = progress_tracker.llm_eval_examples
dict_save_dir = os.path.join(os.path.dirname(checkpoint_manager.directory), "llm_eval_examples")
os.makedirs(dict_save_dir, exist_ok=True)
dict_save_path = os.path.join(dict_save_dir, f"{progress_tracker.checkpoint_number}.csv")
llm_eval_examples = pd.DataFrame(llm_eval_examples).to_dict(orient="records")
with open(dict_save_path, "w", encoding="utf-8") as outfile:
writer = csv.DictWriter(outfile, fieldnames=["inputs", "targets", "outputs"])
writer.writeheader()
writer.writerows(llm_eval_examples)
self.write_eval_summary(
summary_writer=validation_summary_writer,
metrics=progress_tracker.validation_metrics,
step=progress_tracker.steps,
)
self.callback(lambda c: c.on_validation_end(self, progress_tracker, save_path))
if test_set is not None:
self.callback(lambda c: c.on_test_start(self, progress_tracker, save_path))
# eval metrics on test set
self.evaluation(test_set, TEST, progress_tracker.test_metrics, self.eval_batch_size, progress_tracker)
self.write_eval_summary(
summary_writer=test_summary_writer,
metrics=progress_tracker.test_metrics,
step=progress_tracker.steps,
)
self.callback(lambda c: c.on_test_end(self, progress_tracker, save_path))
elapsed_time = (time.time() - start_time) * 1000.0
if self.is_coordinator():
logger.info(f"Evaluation took {time_utils.strdelta(elapsed_time)}\n")
print_metrics_table(
output_features,
progress_tracker.train_metrics,
progress_tracker.validation_metrics,
progress_tracker.test_metrics,
)
# ================ Validation Logic ================
should_break = False
if validation_set is not None and validation_set.size > 0:
should_break = self.check_progress_on_validation(
progress_tracker,
self.validation_field,
self.validation_metric,
save_path,
model_hyperparameters_path,
self.increase_batch_size_on_plateau,
self.increase_batch_size_on_plateau_patience,
self.increase_batch_size_on_plateau_rate,
self.max_batch_size,
self.increase_batch_size_eval_metric,
self.increase_batch_size_eval_split,
early_stopping_steps,
self.skip_save_model,
checkpoint_manager,
)
else:
# There's no validation, so we save the model.
if not self.skip_save_model:
if self.is_coordinator():
logger.info("Saving model.\n")
checkpoint_manager.save_best(progress_tracker.steps)
self.callback(lambda c: c.on_save_best_checkpoint(self, progress_tracker, save_path))
# Trigger eval end callback after any model weights save for complete checkpoint
self.callback(lambda c: c.on_eval_end(self, progress_tracker, save_path))
# Clear the CUDA cache to free up memory
torch.cuda.empty_cache()
return should_break
def save_checkpoint(self, progress_tracker: ProgressTracker, save_path: str, checkpoint_manager: CheckpointManager):
"""Checkpoints the model, progress tracker, and invokes the checkpoint callback."""
progress_tracker.increment_checkpoint()
checkpoint_manager.save(progress_tracker.steps)
if self.is_coordinator():
progress_tracker.save(os.path.join(save_path, TRAINING_PROGRESS_TRACKER_FILE_NAME))
# Callback that the checkpoint was reached, regardless of whether the model was evaluated.
self.callback(lambda c: c.on_checkpoint(self, progress_tracker))
def create_checkpoint_handle(self):
return self.distributed.create_checkpoint_handle(
dist_model=self.dist_model, model=self.model, optimizer=self.optimizer, scheduler=self.scheduler
)
def train(
self,
training_set,
validation_set=None,
test_set=None,
save_path=MODEL_FILE_NAME,
return_state_dict: bool = False,
**kwargs,
):
"""Trains a model with a set of hyperparameters listed below. Customizable.
:param training_set: The training set
:param validation_set: The validation dataset
:param test_set: The test dataset
:param save_path: The directory that will contain the saved model
:param return_state_dict: Whether to return the state dict of the model instead of the model itself
"""
# ====== General setup =======
output_features = self.model.output_features
# Only use signals when on the main thread to avoid issues with CherryPy
# https://github.com/ludwig-ai/ludwig/issues/286
if threading.current_thread() == threading.main_thread():
# set the original sigint signal handler
# as we want to restore it at the end of training
self.original_sigint_handler = signal.getsignal(signal.SIGINT)
signal.signal(signal.SIGINT, self.set_steps_to_1_or_quit)
metrics_names = get_metric_names(output_features)
# ====== Setup file names =======
model_hyperparameters_path = None
tensorboard_log_dir = None
if self.is_coordinator():
os.makedirs(save_path, exist_ok=True)
model_hyperparameters_path = os.path.join(save_path, MODEL_HYPERPARAMETERS_FILE_NAME)
tensorboard_log_dir = os.path.join(save_path, "logs")
# Sync save_path across the workers
save_path = self.distributed.broadcast_object(save_path or "")
training_progress_tracker_path = None
training_checkpoints_path = None
if save_path:
training_progress_tracker_path = os.path.join(save_path, TRAINING_PROGRESS_TRACKER_FILE_NAME)
training_checkpoints_path = os.path.join(save_path, TRAINING_CHECKPOINTS_DIR_PATH)
self.callback(
lambda c: c.on_trainer_train_setup(self, save_path, self.is_coordinator()), coordinator_only=False
)
# ====== Setup session =======
checkpoint = self.create_checkpoint_handle()
checkpoint_manager = CheckpointManager(checkpoint, training_checkpoints_path, device=self.device)
# ====== Setup Tensorboard writers =======
train_summary_writer = None
validation_summary_writer = None
test_summary_writer = None
if self.is_coordinator() and not self.skip_save_log and tensorboard_log_dir:
train_summary_writer = SummaryWriter(os.path.join(tensorboard_log_dir, TRAINING))
if validation_set is not None and validation_set.size > 0:
validation_summary_writer = SummaryWriter(os.path.join(tensorboard_log_dir, VALIDATION))
if test_set is not None and test_set.size > 0:
test_summary_writer = SummaryWriter(os.path.join(tensorboard_log_dir, TEST))
# ================ Resume logic ================
self.callback(lambda c: c.on_resume_training(self.is_coordinator()))
should_resume = self.resume and self.resume_files_exist(
training_progress_tracker_path, training_checkpoints_path
)
# make sure all workers are on the same page about resuming.
should_resume = self.distributed.broadcast_object(should_resume, name="should_resume")
if should_resume:
try:
progress_tracker = self.resume_training_progress_tracker(training_progress_tracker_path)
self.resume_weights_and_optimizer(training_checkpoints_path, checkpoint)
if self.is_coordinator():
logger.info("Resuming training from previous run.")
except Exception:
# This may happen if model training is interrupted after the progress tracker is initialized
# but before any real training progress is made.
progress_tracker = get_new_progress_tracker(
batch_size=self.batch_size,
learning_rate=self.base_learning_rate,
best_eval_metric_value=get_initial_validation_value(self.validation_metric),
best_increase_batch_size_eval_metric=get_initial_validation_value(
self.increase_batch_size_eval_metric
),
output_features=output_features,
)
if self.is_coordinator():
logger.info("Failed to resume training from previous run. Creating fresh model training run.")
else:
progress_tracker = get_new_progress_tracker(
batch_size=self.batch_size,
learning_rate=self.base_learning_rate,
best_eval_metric_value=get_initial_validation_value(self.validation_metric),
best_increase_batch_size_eval_metric=get_initial_validation_value(self.increase_batch_size_eval_metric),
output_features=output_features,
)
if self.is_coordinator():
logger.info("Creating fresh model training run.")
# Distributed: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
self.distributed.sync_model(self.dist_model)
self.distributed.sync_optimizer(self.optimizer)
self.scheduler.load_state_dict(self.distributed.broadcast_object(self.scheduler.state_dict()))
# For DeepSpeed, we need to set the batch size here in case it was modfied during auto-tuning
self.distributed.set_batch_size(self.dist_model, self.batch_size)
set_random_seed(self.random_seed)
if self.enable_profiling:
logger.warning("Full torch profiler is enabled. Training may be significantly slower.")
profiler = torch.profiler.profile(
schedule=torch.profiler.schedule(
wait=self.config.profiler.wait,
warmup=self.config.profiler.warmup,
active=self.config.profiler.active,
repeat=self.config.profiler.repeat,
),
on_trace_ready=torch.profiler.tensorboard_trace_handler(os.path.join(tensorboard_log_dir, "profiling")),
record_shapes=True,
with_stack=True,
profile_memory=True,
)
else:
profiler = None
try:
with training_set.initialize_batcher(
batch_size=self.batch_size,
should_shuffle=self.should_shuffle,
random_seed=self.random_seed,
distributed=self.distributed,
ignore_last=True,
augmentation_pipeline=self.model.get_augmentation_pipelines(),
) as batcher:
# ================ Training Loop ================
self.steps_per_epoch = batcher.steps_per_epoch
self.total_steps = get_total_steps(self.epochs, batcher.steps_per_epoch, self.train_steps)
# NOTE(geoffrey): this ensures that the total number of epochs coincides with the number of
# times `batcher.set_epoch` is called.
old_epochs = self.epochs
self.epochs = math.ceil(self.total_steps / self.steps_per_epoch)
if old_epochs != self.epochs:
logger.warning(
f"The number of epochs has been adjusted from config-specified {old_epochs} "
f"to {self.epochs} to match the total number of steps."
)
# Get the terminal steps per checkpoint.
final_steps_per_checkpoint = get_final_steps_per_checkpoint(
batcher.steps_per_epoch,
self.steps_per_checkpoint,
self.checkpoints_per_epoch,
self.is_coordinator(),
)
final_steps_per_checkpoint = min(final_steps_per_checkpoint, self.total_steps)
early_stopping_steps = final_steps_per_checkpoint * self.early_stop
if not self.skip_save_progress:
self.total_expected_checkpoints = get_total_expected_checkpoints(
self.total_steps, final_steps_per_checkpoint, self.epochs
)
# Initialize the learning rate scheduler.
self.scheduler = LRScheduler(
self.config.learning_rate_scheduler,
self.optimizer,
steps_per_checkpoint=final_steps_per_checkpoint,
total_steps=self.total_steps,
)
if self.is_coordinator():
logger.info(
f"Training for {self.total_steps} step(s), approximately "
f"{int(self.total_steps / batcher.steps_per_epoch)} epoch(s)."
)
if self.early_stop < 0:
logger.info("Early stopping policy: None")
else:
logger.info(
f"Early stopping policy: {self.early_stop} round(s) of evaluation, or "
f"{early_stopping_steps} step(s), approximately "
f"{int(early_stopping_steps / batcher.steps_per_epoch)} epoch(s).\n"
)
logger.info(f"Starting with step {progress_tracker.steps}, epoch: {progress_tracker.epoch}")
progress_bar_config = {
"desc": "Training",
"initial": progress_tracker.steps,
"total": self.total_steps,
"disable": is_progressbar_disabled(),
"file": sys.stdout,
}
progress_bar = LudwigProgressBar(self.report_tqdm_to_ray, progress_bar_config, self.is_coordinator())
if profiler:
profiler.start()
while progress_tracker.steps < self.total_steps:
# note that batch size may change over epochs
batcher.set_epoch(progress_tracker.epoch, progress_tracker.batch_size)
# epoch init
start_time = time.time()
# Reset the metrics at the start of the next epoch
self.dist_model.train() # Sets model to training mode.
self.model.reset_metrics()
self.callback(lambda c: c.on_epoch_start(self, progress_tracker, save_path))
# Trains over a full epoch of data or up to the last training step, whichever is sooner.
should_break, has_nan_or_inf_tensors = self._train_loop(
batcher,
progress_tracker,
save_path,
train_summary_writer,
progress_bar,
training_set,
validation_set,
test_set,
start_time,
validation_summary_writer,
test_summary_writer,
model_hyperparameters_path,
output_features,
metrics_names,
checkpoint_manager,
final_steps_per_checkpoint,
early_stopping_steps,
profiler,
)
if self.is_coordinator():
# ========== Save training progress ==========
logger.debug(
f"Epoch {progress_tracker.epoch} took: "
f"{time_utils.strdelta((time.time() - start_time) * 1000.0)}."
)
# Skip saving progress if we're not saving the model. We should do this so as to not overwrite the
# best model checkpoint from the previous round of evaluation so that the previous best model
# weights can be used for inference instead of the current weights which are in a bad state.
if has_nan_or_inf_tensors:
break
if not self.skip_save_progress:
self.save_checkpoint(progress_tracker, save_path, checkpoint_manager)
if not self.skip_save_model and self.skip_all_evaluation:
# All evaluation was skipped, so save the current step as the best so far.
checkpoint_manager.save_best(progress_tracker.steps)
# Early stop if needed.
if should_break:
break
finally:
# ================ Finished Training ================
self.callback(
lambda c: c.on_trainer_train_teardown(self, progress_tracker, save_path, self.is_coordinator()),
coordinator_only=False,
)
# Deactivate any forward hooks for the model used at training time.
self.compiled_model._deactivate_forward_hooks()
# Stop the profiler.
if profiler:
profiler.stop()
# Close the summary writers.
if train_summary_writer is not None:
train_summary_writer.close()
if validation_summary_writer is not None:
validation_summary_writer.close()
if test_summary_writer is not None:
test_summary_writer.close()
if not self.skip_save_model and self.skip_all_evaluation and not has_nan_or_inf_tensors:
# All evaluation was skipped, so save the current step as the best so far.
checkpoint_manager.save_best(progress_tracker.steps)
if not self.skip_save_progress:
checkpoint_manager.close()
# Load the best weights from saved checkpoint
state_dict = None
if self.distributed.is_coordinator():
if not self.skip_save_model:
state_dict = checkpoint_manager.get_best_checkpoint_state_for_inference(self.return_device)
if not state_dict:
error_message = "Training ran into an error. No checkpoint was saved."
if has_nan_or_inf_tensors:
error_message += (
" This is because training was terminated early due to the presence of NaN or "
"Inf values in the model weights before a single valid checkpoint could be saved."
)
raise RuntimeError(error_message)
if not return_state_dict:
if self.distributed.is_model_parallel():
# Assume the full weights cannot fit in memory on GPU
self.model = self.model.cpu()
# For a full explanation of this 8-bit workaround, see https://github.com/ludwig-ai/ludwig/pull/3606
# TODO (jeffkinnison): Determine why `SCB` and `CB` are deleted from parameter state
quantization = get_quantization(self.model.config_obj)
uses_quantization = bool(quantization) if not isinstance(quantization, list) else any(quantization)
if uses_quantization and 8 in quantization:
# If the model was previously placed on GPU, 8-bit parameter state will be updated with several
# matrices containing quantization information. These are recorded matrices are recorded in the
# training checkpoint state dicts, but do not necessarily exist in the parameter object, leading
# to a RuntimeError in `load_state_dict`. Explicitly call `model.cuda()` to make sure the
# matrices are part of model state. This workaround is necessary because the matrices are
# deleted during the model's forward pass.
if self.model.config_obj.model_type == MODEL_LLM and self.model.model.device.type == "cuda":
self.model.model.cuda()
elif self.model.config_obj.model_type == MODEL_ECD and self.model.device.type == "cuda":
self.model.cuda()
_, unexpected_keys = self.model.load_state_dict(state_dict, strict=False)
only_weights_format_keys = ["weights_format" in k for k in unexpected_keys]
# bitsandbytes adds a number of `weights_format` metadata fields to the state dict in
# `Linear8bitLt._save_to_state_dict`. These contain information about how the 8-bit tensors
# are tiled, but the fields themselves never exist in the module and get returned as unexpected
# keys when loading the state dict. The
assert (
unexpected_keys == [] or only_weights_format_keys
), f"Unexpected keys found in state dict: {unexpected_keys}"
else:
_, unexpected_keys = self.model.load_state_dict(state_dict, strict=False)
assert unexpected_keys == [], f"Unexpected keys found in state dict: {unexpected_keys}"
elif return_state_dict:
state_dict = self.model.cpu().state_dict()
# When running with Ray, we only need to return the state dict, as it's faster and cheaper to send the
# state dict over the network than to load the model state here, serialize it back to a state dict, then
# load it back on the head node.
return_value = self.model if not return_state_dict else state_dict
# restore original sigint signal handler
if self.original_sigint_handler and threading.current_thread() == threading.main_thread():
signal.signal(signal.SIGINT, self.original_sigint_handler)
return (
return_value,
progress_tracker.train_metrics,
progress_tracker.validation_metrics,
progress_tracker.test_metrics,
)
def _train_loop(
self,
batcher,
progress_tracker: ProgressTracker,
save_path,
train_summary_writer,
progress_bar: LudwigProgressBar,
training_set,
validation_set,
test_set,
start_time,
validation_summary_writer,
test_summary_writer,
model_hyperparameters_path,
output_features,
metrics_names,
checkpoint_manager: CheckpointManager,
final_steps_per_checkpoint: int,
early_stopping_steps: int,
profiler: torch.profiler.profile | None,
) -> tuple[bool, bool]:
"""Completes up to one epoch through the data.
This function completes a single pass (epoch) through the training data and returns
two boolean values:
Returns:
should_break (bool):
Indicates whether the training loop should be terminated prematurely.
has_nan_or_inf_tensors (bool):
Indicates whether the model weights contain NaN or Inf values.
"""
self.distributed.zero_grad(self.optimizer)
batch_idx = 0
should_break = False
has_nan_or_inf_tensors = False
while not batcher.last_batch() and progress_tracker.steps < self.total_steps and not should_break:
progress_tracker.learning_rate = self.optimizer.param_groups[0]["lr"]
self.callback(lambda c: c.on_batch_start(self, progress_tracker, save_path))
# obtain batch
batch = batcher.next_batch()
# determine whether we need to accumulate gradients as trigger a full parameter update
should_sync_grads = (batch_idx + 1) % self.gradient_accumulation_steps == 0
is_checkpoint_step = (progress_tracker.steps + 1) % final_steps_per_checkpoint == 0
should_step = should_sync_grads or is_checkpoint_step
batch_idx += 1
# Move tensors to cuda here.
inputs = {
i_feat.feature_name: torch.from_numpy(np.array(batch[i_feat.proc_column], copy=True)).to(self.device)
for i_feat in self.model.input_features.values()
}
targets = {
o_feat.feature_name: torch.from_numpy(np.array(batch[o_feat.proc_column], copy=True)).to(self.device)
for o_feat in self.model.output_features.values()
}
loss, all_losses, used_tokens = self.train_step(inputs, targets, should_step=should_step, profiler=profiler)
# Update LR schduler here instead of train loop to avoid updating during batch size tuning, etc.
self.scheduler.step()
# Update progress tracker with token information.
progress_tracker.set_token_usage_for_this_step(used_tokens)
if self.is_coordinator() and not self.skip_save_log:
self.write_step_summary(
train_summary_writer=train_summary_writer,
combined_loss=loss.detach().float(),
all_losses=all_losses,
step=progress_tracker.steps,
used_tokens=used_tokens,
total_tokens_used=progress_tracker.total_tokens_used,
learning_rate=progress_tracker.learning_rate,
)
progress_tracker.steps += 1
progress_bar.set_postfix({"loss": loss.detach().item()})
progress_bar.update(1)
if self.is_coordinator():
logger.debug(
"training: completed batch %s memory used: %.2fMB",
progress_bar.total_steps,
psutil.Process(os.getpid()).memory_info()[0] / 1e6,
)
# Executing `on_batch_end` calls before `run_evaluation` enables more accurate
# batch duration measurements when using timer callbacks.
self.callback(lambda c: c.on_batch_end(self, progress_tracker, save_path, sync_step=should_step))
# If this is the last batch in the epoch, increment before running evaluation so that metrics are reported
# with the correct epoch.
if batcher.last_batch():
progress_tracker.epoch += 1
if progress_tracker.steps % final_steps_per_checkpoint == 0:
# Before continuing to evaluation or skipping evaluation altogether, we should use this point to
# ensure that the model weights are not NaN or Inf.
has_nan_or_inf_tensors = self._has_nan_or_inf_weights(self.dist_model)
# If a nan/inf tensor is detected, we should break out of the training loop immediately and raise an #
# error. There is no point in running evaluation for this step as the model weights are already in
# a bad state. Theere is also no point in continuing to train the model since the loss will always be
# NaN or Inf from this point forward.
if has_nan_or_inf_tensors:
return True, has_nan_or_inf_tensors
if not self.skip_all_evaluation:
# Publishes metrics to MLFLow if there are any MLFlow callbacks.
should_break = self.run_evaluation(
training_set,
validation_set,
test_set,
progress_tracker,
train_summary_writer,
validation_summary_writer,
test_summary_writer,
model_hyperparameters_path,
output_features,
metrics_names,
save_path,
loss,
all_losses,
early_stopping_steps,
checkpoint_manager,
)
else:
should_break = False
# Checkpoint the model.
# NOTE: Ideally we would do this before evaluation, but for some reason DeepSpeed will complain
# about inflight params if we do that, which is why we checkpoint after eval instead. In practice,
# this should not make a difference, except in the unlikely event an error occurs during eval and we
# want to resume from the last checkpoint, in which case we will lose slightly more progress this way.
if not self.skip_save_progress:
self.save_checkpoint(progress_tracker, save_path, checkpoint_manager)
# If this was the last batch, then increment the epoch counter and invoke the `on_epoch_end` callback.
if batcher.last_batch():
self.callback(lambda c: c.on_epoch_end(self, progress_tracker, save_path))
return should_break, has_nan_or_inf_tensors
def _has_nan_or_inf_weights(self, model: torch.nn.Module) -> bool:
"""Check for NaN or infinity (inf) values in the weights (parameters and buffers) of a PyTorch model in a
local or distributed training environment. It is called to ensure the model's numerical stability during
training. It works for both model parallel and data parallel training.
This function recursively inspects the model's parameters and buffers to identify NaN or inf values. It
communicates and aggregates the results across all distributed processes using the `all_reduce` operation. If
any process finds NaN or inf values, it is considered a critical error, and the main coordinator process will
return True to halt training in the main training loop.
Parameters:
model (torch.nn.Module): The PyTorch model to check for NaN or inf weights.
Returns:
bool: Returns True if any NaN or inf tensors are found in the model's weights. Otherwise, returns False.
"""
local_has_nan_or_inf = contains_nan_or_inf_tensors(model)
# Use all_reduce to aggregate local_has_nan across all processes and sum the result into global_has_nan, which
# will be a tensor with a single element on all processes after the all_reduce operation.
global_has_nan_or_inf = torch.tensor(int(local_has_nan_or_inf), device=self.device)
self.distributed.allreduce(global_has_nan_or_inf)
# The main coordinator process will raise a runtime error if any of the processes found NaN or inf weights.
if self.distributed.local_rank() == 0:
if global_has_nan_or_inf.item() > 0:
logger.warning("NaN or inf tensors found in the model. Stopping training.")
return True
return False
def train_online(self, dataset):
self.dist_model.train() # Sets model training mode.
with dataset.initialize_batcher(
batch_size=self.batch_size,
should_shuffle=self.should_shuffle,
distributed=self.distributed,
ignore_last=True,
) as batcher:
# training step loop
progress_bar_config = {
"desc": "Training online",
"total": batcher.steps_per_epoch,
"file": sys.stdout,
"disable": is_progressbar_disabled(),
}
progress_bar = LudwigProgressBar(self.report_tqdm_to_ray, progress_bar_config, self.is_coordinator())
while not batcher.last_batch():
batch = batcher.next_batch()
inputs = {
i_feat.feature_name: torch.from_numpy(np.array(batch[i_feat.proc_column], copy=True)).to(
self.device
)
for i_feat in self.model.input_features.values()
}
targets = {
o_feat.feature_name: torch.from_numpy(np.array(batch[o_feat.proc_column], copy=True)).to(
self.device
)
for o_feat in self.model.output_features.values()
}
self.train_step(
inputs,
targets,
)
progress_bar.update(1)
progress_bar.close()
return self.model
@property
def validation_field(self):
return self._validation_field
@property
def validation_metric(self):
return self._validation_metric
def evaluation(self, dataset, dataset_name, metrics_log, batch_size, progress_tracker):
predictor = Predictor(
self.dist_model,
batch_size=batch_size,
distributed=self.distributed,
report_tqdm_to_ray=self.report_tqdm_to_ray,
model=self.model,
)
metrics, _ = predictor.batch_evaluation(dataset, collect_predictions=False, dataset_name=dataset_name)
return append_metrics(self.model, dataset_name, metrics, metrics_log, progress_tracker)
def check_progress_on_validation(
self,
progress_tracker,
validation_output_feature_name,
validation_metric: str,
save_path,
model_hyperparameters_path,
increase_batch_size_on_plateau,
increase_batch_size_on_plateau_patience,
increase_batch_size_on_plateau_rate,
increase_batch_size_on_plateau_max,
increase_batch_size_eval_metric,
increase_batch_size_eval_split,
early_stopping_steps: int,
skip_save_model,
checkpoint_manager: CheckpointManager,
) -> bool:
"""Checks the history of validation scores.
Uses history of validation scores to reduce learning rate, increase batch size, and decide whether training
should stop.
Saves the model if scores have improved.
Returns whether the model should stop training.
"""
should_break = False
improved_fn = get_improved_fn(validation_metric)
all_validation_metrics = progress_tracker.validation_metrics[validation_output_feature_name]
# The most recent validation_metric metric.
eval_metric: TrainerMetric = all_validation_metrics[validation_metric][-1]
eval_metric_value = eval_metric[-1]
if eval_metric_value != eval_metric_value:
# Fallback to 0 if the validation metric value is a NaN.
# This is potentially relevant for small datasets like those used in testing where if there's only a
# single output label, some metrics like ROC may turn out to be NaN.
# However, we want to guarantee that the model will be saved at least once over a full
# training-checkpoint-eval-loop.
eval_metric_value = 0
if improved_fn(eval_metric_value, progress_tracker.best_eval_metric_value):
previous_best_eval_metric_value = progress_tracker.best_eval_metric_value
# Save the value, steps, epoch, and checkpoint number.
progress_tracker.best_eval_metric_value = eval_metric_value
progress_tracker.best_eval_metric_steps = progress_tracker.steps
progress_tracker.best_eval_metric_epoch = progress_tracker.epoch
progress_tracker.best_eval_metric_checkpoint_number = progress_tracker.checkpoint_number
# Save best metrics for all data subsets.
progress_tracker.best_eval_train_metrics = get_latest_metrics_dict(progress_tracker.train_metrics)
progress_tracker.best_eval_validation_metrics = get_latest_metrics_dict(progress_tracker.validation_metrics)
progress_tracker.best_eval_test_metrics = get_latest_metrics_dict(progress_tracker.test_metrics)
if self.is_coordinator():
logger.info(
f"Evaluation validation metric: '{validation_output_feature_name}' '{validation_metric}' improved."
)
absolute_eval_metric_value_change = round(
abs(previous_best_eval_metric_value - progress_tracker.best_eval_metric_value), 3
)
if get_metric_objective(validation_metric) == MINIMIZE:
logger.info(
f"'{validation_output_feature_name}' '{validation_metric}' decreased by "
f"{absolute_eval_metric_value_change}."
)
else:
logger.info(
f"'{validation_output_feature_name}' '{validation_metric}' increased by "
f"{absolute_eval_metric_value_change}."
)
# Save the model.
if not skip_save_model:
logger.info("New best model saved.\n")
checkpoint_manager.save_best(progress_tracker.steps)
self.callback(lambda c: c.on_save_best_checkpoint(self, progress_tracker, save_path))
last_improvement_in_steps = progress_tracker.steps - progress_tracker.best_eval_metric_steps
progress_tracker.last_improvement_steps = last_improvement_in_steps
if last_improvement_in_steps != 0 and self.is_coordinator():
logger.info(
f"Last improvement of {validation_output_feature_name} validation {validation_metric} happened "
+ f"{last_improvement_in_steps} step(s) ago.\n"
)
# ========== Learning Rate Schedule evaluation updates ========
self.scheduler.eval_step(progress_tracker, validation_output_feature_name)
# ========== Increase Batch Size Plateau logic =========
if increase_batch_size_on_plateau > 0:
self.increase_batch_size(
progress_tracker,
validation_output_feature_name,
increase_batch_size_on_plateau,
increase_batch_size_on_plateau_patience,
increase_batch_size_on_plateau_rate,
increase_batch_size_on_plateau_max,
increase_batch_size_eval_metric,
increase_batch_size_eval_split,
)
progress_tracker.last_increase_batch_size = (
progress_tracker.steps - progress_tracker.last_increase_batch_size_steps
)
if (
progress_tracker.last_increase_batch_size > 0
and progress_tracker.last_increase_batch_size_eval_metric_improvement > 0
and not progress_tracker.num_increases_batch_size >= increase_batch_size_on_plateau
and not progress_tracker.batch_size >= increase_batch_size_on_plateau_max
):
logger.info(
"Last batch size increase "
f"happened {progress_tracker.last_increase_batch_size} step(s) ago, "
f"improvement of {validation_output_feature_name} {increase_batch_size_eval_split} "
f"{increase_batch_size_eval_metric} happened "
f"{progress_tracker.last_increase_batch_size_eval_metric_improvement} step(s) ago."
)
# ========== Early Stop logic ==========
# If any early stopping condition is satisfied, either lack of improvement for many steps, or via callbacks on
# any worker, then trigger early stopping.
early_stop_bool = 0 < early_stopping_steps <= last_improvement_in_steps
if not early_stop_bool:
for callback in self.callbacks:
if callback.should_early_stop(self, progress_tracker, self.is_coordinator()):
early_stop_bool = True
break
should_early_stop = torch.as_tensor([early_stop_bool], dtype=torch.int, device=self.device)
should_early_stop = self.distributed.allreduce(should_early_stop)
if should_early_stop.item():
if self.is_coordinator():
logger.info(
f"\nEARLY STOPPING due to lack of validation improvement. It has been {last_improvement_in_steps} "
"step(s) since last validation improvement."
)
should_break = True
return should_break
def set_steps_to_1_or_quit(self, signum, frame):
"""Custom SIGINT handler used to elegantly exit training.
A single SIGINT will stop training after the next training step. A second SIGINT will stop training immediately.
"""
if not self.received_sigint:
self.total_steps = 1
self.received_sigint = True
logger.critical("\nReceived SIGINT, will finish this training step and then conclude training.")
logger.critical("Send another SIGINT to immediately interrupt the process.")
else:
logger.critical("\nReceived a second SIGINT, will now quit")
if self.original_sigint_handler:
signal.signal(signal.SIGINT, self.original_sigint_handler)
sys.exit(1)
@staticmethod
def resume_files_exist(
training_progress_tracker_path: str,
training_checkpoint_path: str,
) -> bool:
missing_files = []
# training_progress.json
if not path_exists(training_progress_tracker_path):
missing_files.append(training_progress_tracker_path)
# latest.ckpt in training_checkpoints/
latest_ckpt = os.path.join(training_checkpoint_path, "latest.ckpt")
if not path_exists(latest_ckpt):
missing_files.append(latest_ckpt)
if missing_files:
logger.warning(f"Could not find {missing_files} while trying to resume model training.")
return False
return True
def resume_training_progress_tracker(self, training_progress_tracker_path):
progress_tracker_dict = None
if self.is_coordinator():
logger.info(f"Loading progress tracker for model: {training_progress_tracker_path}")
progress_tracker_dict = load_json(training_progress_tracker_path)
logger.debug("Broadcasting model progress tracker dict to all workers")
progress_tracker_dict = self.distributed.broadcast_object(
progress_tracker_dict, name="broadcast_progress_tracker"
)
progress_tracker = ProgressTracker.load(progress_tracker_dict)
return progress_tracker
def resume_weights_and_optimizer(
self,
model_weights_progress_path: str,
checkpoint: Checkpoint,
):
CheckpointManager.load_latest_checkpoint(checkpoint, model_weights_progress_path, self.device)
def increase_batch_size(
self,
progress_tracker: ProgressTracker,
validation_output_feature_name: str,
increase_batch_size_on_plateau: int,
increase_batch_size_on_plateau_patience: int,
increase_batch_size_on_plateau_rate: float,
increase_batch_size_on_plateau_max: int,
increase_batch_size_eval_metric: str = LOSS,
increase_batch_size_eval_split: str = TRAINING,
):
"""Uses the progress tracker to determine if the batch size should be increased."""
if (
not progress_tracker.num_increases_batch_size >= increase_batch_size_on_plateau
and not progress_tracker.batch_size == increase_batch_size_on_plateau_max
):
if increase_batch_size_eval_split == TRAINING:
split_metrics = progress_tracker.train_metrics
elif increase_batch_size_eval_split == VALIDATION:
split_metrics = progress_tracker.validation_metrics
else: # if increase_batch_size_eval_split == TEST:
split_metrics = progress_tracker.test_metrics
validation_metric = increase_batch_size_eval_metric
last_metric = split_metrics[validation_output_feature_name][validation_metric][-1]
last_metric_value = last_metric[-1]
improved_fn = get_improved_fn(validation_metric)
is_improved = improved_fn(last_metric_value, progress_tracker.best_increase_batch_size_eval_metric)
if is_improved:
# We update the best metric value and set it to the current one, and reset last
# improvement step count
progress_tracker.best_increase_batch_size_eval_metric = last_metric_value
progress_tracker.last_increase_batch_size_eval_metric_improvement = 0
else:
progress_tracker.last_increase_batch_size_eval_metric_improvement += 1
if not is_improved and (
# Batch size increase happened more than N steps ago
progress_tracker.last_increase_batch_size >= increase_batch_size_on_plateau_patience
and (
# No improvement of the evaluation metric since more than N steps ago
progress_tracker.last_increase_batch_size_eval_metric_improvement
>= increase_batch_size_on_plateau_patience
)
):
progress_tracker.batch_size = min(
int(increase_batch_size_on_plateau_rate * progress_tracker.batch_size),
increase_batch_size_on_plateau_max,
)
if self.is_coordinator():
logger.info(
f"PLATEAU REACHED, increasing batch size to {progress_tracker.batch_size} due to lack of "
f"improvement of {validation_output_feature_name} {increase_batch_size_eval_split} "
f"{validation_metric}."
)
progress_tracker.last_increase_batch_size_steps = progress_tracker.steps
progress_tracker.last_increase_batch_size = 0
progress_tracker.num_increases_batch_size += 1
if progress_tracker.num_increases_batch_size >= increase_batch_size_on_plateau:
if self.is_coordinator():
logger.info(
f"Batch size was already increased {progress_tracker.num_increases_batch_size} times, "
"not increasing it anymore."
)
elif progress_tracker.batch_size >= increase_batch_size_on_plateau_max:
if self.is_coordinator():
logger.info(
f"Batch size was already increased {progress_tracker.num_increases_batch_size} times, "
f"currently it is {progress_tracker.batch_size}, the maximum allowed."
)
def is_coordinator(self):
return self.distributed.rank() == 0
@property
def local_rank(self) -> int:
return self.distributed.local_rank()
def barrier(self):
self.distributed.barrier()
def callback(self, fn, coordinator_only=True):
if not coordinator_only or self.is_coordinator():
for callback in self.callbacks:
fn(callback)
@property
def return_device(self):
return self.device
class RemoteTrainer(Trainer):
def __init__(self, gpus=None, gpu_memory_limit=None, allow_parallel_threads=True, **kwargs):
super().__init__(**kwargs)
# Only return results from rank 0 to reduce network overhead
self.train = self.distributed.return_first(self.train)
self.train_online = self.distributed.return_first(self.train_online)
@property
def return_device(self):
# When returning the model weights from remote to driver, place them on CPU,
# as the driver likely doesn't have a GPU.
return "cpu"
learning_rate_scale_fns = {
"linear": lambda n: n,
"sqrt": lambda n: math.sqrt(n),
"constant": lambda n: 1,
}
================================================
FILE: ludwig/trainers/trainer_llm.py
================================================
import logging
import os
import time
from collections.abc import Callable
from typing import Union
from torch.utils.tensorboard import SummaryWriter
from ludwig.constants import MINIMUM_BATCH_SIZE, TEST, TRAINING, VALIDATION
from ludwig.data.dataset.base import Dataset
from ludwig.distributed.base import DistributedStrategy, LocalStrategy
from ludwig.features.feature_utils import LudwigFeatureDict
from ludwig.globals import MODEL_FILE_NAME
from ludwig.models.llm import LLM
from ludwig.models.predictor import LlmFineTunePredictor, LlmPredictor
from ludwig.modules.metric_modules import get_initial_validation_value
from ludwig.schema.trainer import BaseTrainerConfig, FineTuneTrainerConfig, NoneTrainerConfig
from ludwig.trainers.base import BaseTrainer
from ludwig.trainers.registry import register_llm_ray_trainer, register_llm_trainer
from ludwig.trainers.trainer import Trainer
from ludwig.types import ModelConfigDict
from ludwig.utils import time_utils
from ludwig.utils.batch_size_tuner import (
BatchSizeEvaluator,
LLMFinetunePredictBatchSizeEvaluator,
LLMFinetuneTrainerBatchSizeEvaluator,
)
from ludwig.utils.defaults import default_random_seed
from ludwig.utils.metric_utils import TrainerMetric
from ludwig.utils.metrics_printed_table import print_metrics_table
from ludwig.utils.misc_utils import set_random_seed
from ludwig.utils.torch_utils import get_torch_device
from ludwig.utils.trainer_utils import append_metrics, get_new_progress_tracker, ProgressTracker
logger = logging.getLogger(__name__)
MAX_EVALUATION_EXAMPLES = 1000
MAX_EVALUATION_EXAMPLES_SHOWN = 5
@register_llm_trainer("none")
@register_llm_ray_trainer("none")
class NoneTrainer(BaseTrainer):
"""NoneTrainer is a trainer that does not train a model, only runs evaluation."""
def __init__(
self,
config: NoneTrainerConfig,
model: LLM,
resume: float = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
callbacks: list = None,
report_tqdm_to_ray=False,
random_seed: float = default_random_seed,
distributed: DistributedStrategy | None = None,
device: str | None = None,
**kwargs,
):
"""
:param config: `ludwig.schema.trainer.NoneTrainerConfig` instance that specifies training hyperparameters
(default: `ludwig.schema.trainer.NoneTrainerConfig()`).
:param model: Underlying Ludwig model
:type model: `ludwig.models.llm.LLM`
:param resume: Resume training a model that was being trained. (default: False).
:type resume: Boolean
:param skip_save_model: Disables saving model weights and hyperparameters each time the model improves. By
default Ludwig saves model weights after each round of evaluation the validation metric (improves, but
if the model is really big that can be time consuming. If you do not want to keep the weights and just
find out what performance a model can get with a set of hyperparameters, use this parameter to skip it,
but the model will not be loadable later on. (default: False).
:type skip_save_model: Boolean
:param skip_save_progress: Disables saving progress each round of evaluation. By default Ludwig saves weights
and stats after each round of evaluation for enabling resuming of training, but if the model is really
big that can be time consuming and will uses twice as much space, use this parameter to skip it, but
training cannot be resumed later on. (default: False).
:type skip_save_progress: Boolean
:param skip_save_log: Disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if
it is not needed turning it off can slightly increase the overall speed. (default: False).
:type skip_save_log: Boolean
:param callbacks: List of `ludwig.callbacks.Callback` objects that provide hooks into the Ludwig pipeline.
(default: None).
:type callbacks: list
:param report_tqdm_to_ray: Enables using the ray based tqdm Callback for progress bar reporting
:param random_seed: Default initialization for the random seeds (default: 42).
:type random_seed: Float
:param distributed: Distributed strategy (default: None).
:type distributed: `DistributedStrategy`
:param device: Device to load the model on from a saved checkpoint (default: None).
:type device: str
"""
super().__init__()
# Ensure distributed strategy is initialized for metric sync_context.
# NoneTrainer may run on the head node (not in a Ray Train worker),
# so init_dist_strategy may not have been called yet.
from ludwig.distributed import init_dist_strategy
init_dist_strategy("local")
self.config = config
self.distributed = distributed if distributed is not None else LocalStrategy()
self.skip_save_log = skip_save_log
self.resume = resume
self.skip_save_model = skip_save_model
self.skip_save_progress = skip_save_progress
self.random_seed = random_seed
self.callbacks = callbacks or []
self.report_tqdm_to_ray = report_tqdm_to_ray
self.device = device if device is not None else get_torch_device()
self.model = model.to_device(self.device)
self.model.metrics_to_device(self.device)
# Since we are only running evaluation without training, set the model to evaluation mode.
self.model.eval()
self.batch_size = self.config.batch_size
self.eval_batch_size = self.config.eval_batch_size
self.base_learning_rate = self.config.base_learning_rate
self.should_shuffle = self.config.should_shuffle
self.epochs = self.config.epochs
self.train_steps = self.config.train_steps
self.steps_per_checkpoint = self.config.steps_per_checkpoint
self.checkpoints_per_epoch = self.config.checkpoints_per_epoch
self.early_stop = self.config.early_stop
self.evaluate_training_set = self.config.evaluate_training_set
self.skip_all_evaluation = self.config.skip_all_evaluation
def close_writers(
self, progress_tracker, save_path, train_summary_writer, validation_summary_writer, test_summary_writer
):
# ================ Finished Training ================
self.callback(
lambda c: c.on_trainer_train_teardown(self, progress_tracker, save_path, self.is_coordinator()),
coordinator_only=False,
)
if train_summary_writer is not None:
train_summary_writer.close()
if validation_summary_writer is not None:
validation_summary_writer.close()
if test_summary_writer is not None:
test_summary_writer.close()
def train(
self,
training_set: Dataset,
validation_set: Dataset | None = None,
test_set: Dataset | None = None,
save_path: str = MODEL_FILE_NAME,
return_state_dict: bool = False,
**kwargs,
):
output_features = self.model.output_features
# ====== Setup file names =======
tensorboard_log_dir = None
if self.is_coordinator():
os.makedirs(save_path, exist_ok=True)
tensorboard_log_dir = os.path.join(save_path, "logs")
self.callback(
lambda c: c.on_trainer_train_setup(self, save_path, self.is_coordinator()), coordinator_only=False
)
train_summary_writer = None
validation_summary_writer = None
test_summary_writer = None
if self.is_coordinator() and not self.skip_save_log and tensorboard_log_dir:
train_summary_writer = SummaryWriter(os.path.join(tensorboard_log_dir, TRAINING))
if validation_set is not None and validation_set.size > 0:
validation_summary_writer = SummaryWriter(os.path.join(tensorboard_log_dir, VALIDATION))
if test_set is not None and test_set.size > 0:
test_summary_writer = SummaryWriter(os.path.join(tensorboard_log_dir, TEST))
set_random_seed(self.random_seed)
progress_tracker = get_new_progress_tracker(
batch_size=self.batch_size,
learning_rate=self.base_learning_rate,
best_eval_metric_value=get_initial_validation_value(self.validation_metric),
best_increase_batch_size_eval_metric=get_initial_validation_value(self.validation_metric),
output_features=output_features,
)
# When running with Ray, we only need to return the state dict, as it's faster and cheaper to send the
# state dict over the network than to load the model state here, serialize it back to a state dict, then
# load it back on the head node.
return_value = self.model if not return_state_dict else self.model.cpu().state_dict()
if self.skip_all_evaluation:
self.close_writers(
progress_tracker, save_path, train_summary_writer, validation_summary_writer, test_summary_writer
)
return (
return_value,
progress_tracker.train_metrics,
progress_tracker.validation_metrics,
progress_tracker.test_metrics,
)
try:
self.run_evaluation(
training_set,
validation_set,
test_set,
progress_tracker,
train_summary_writer,
validation_summary_writer,
test_summary_writer,
output_features,
save_path,
)
finally:
self.close_writers(
progress_tracker, save_path, train_summary_writer, validation_summary_writer, test_summary_writer
)
return (
return_value,
progress_tracker.train_metrics,
progress_tracker.validation_metrics,
progress_tracker.test_metrics,
)
def train_online(
self,
dataset,
):
pass
def tune_batch_size(
self,
config: ModelConfigDict,
training_set: Dataset,
random_seed: int = default_random_seed,
max_trials: int = 20,
halving_limit: int = 3,
snapshot_weights: bool = True,
on_best_batch_size_updated: Callable[[int, float, int], None] | None = None,
tune_for_training: bool = True,
) -> int:
# TODO: Implement batch size tuning for LLM, currently just returns the default batch size
# Compared to ECD, this just requires forward passes till we OOM.
# https://github.com/ludwig-ai/ludwig/issues/3525
return MINIMUM_BATCH_SIZE
@property
def validation_field(self):
return self.config.validation_field
@property
def validation_metric(self):
return self.config.validation_metric
# Remote implementations may override this
def shutdown(self):
pass
@property
def local_rank(self) -> int:
return 0
def barrier(self):
pass
# Functions needed to treat Trainer as a context manager
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.shutdown()
@staticmethod
def get_schema_cls() -> BaseTrainerConfig:
return NoneTrainerConfig
def is_coordinator(self) -> bool:
return self.distributed.rank() == 0
def callback(self, fn, coordinator_only=True):
if not coordinator_only or self.is_coordinator():
for callback in self.callbacks:
fn(callback)
def evaluation(
self,
dataset: "Dataset", # noqa: F821
dataset_name: str,
metrics_log: dict[str, dict[str, list[TrainerMetric]]],
batch_size: int,
progress_tracker: ProgressTracker,
):
predictor = LlmPredictor(
self.model, batch_size=batch_size, distributed=self.distributed, report_tqdm_to_ray=self.report_tqdm_to_ray
)
metrics, _ = predictor.batch_evaluation(dataset, collect_predictions=False, dataset_name=dataset_name)
return append_metrics(self.model, dataset_name, metrics, metrics_log, progress_tracker)
@classmethod
def write_eval_summary(
cls,
summary_writer,
metrics,
step,
):
if not summary_writer:
return
for feature_name, output_feature in metrics.items():
for metric_name, metrics in output_feature.items():
if metrics:
metric_tag = f"{feature_name}/epoch_{metric_name}"
metric_val = metrics[-1][-1]
summary_writer.add_scalar(metric_tag, metric_val, global_step=step)
summary_writer.flush()
def run_evaluation(
self,
training_set: Union["Dataset", "RayDataset"], # noqa: F821
validation_set: Union["Dataset", "RayDataset"] | None, # noqa: F821
test_set: Union["Dataset", "RayDataset"] | None, # noqa: F821
progress_tracker: ProgressTracker,
train_summary_writer: SummaryWriter,
validation_summary_writer: SummaryWriter,
test_summary_writer: SummaryWriter,
output_features: LudwigFeatureDict,
save_path: str,
) -> bool:
"""Runs evaluation over training, validation, and test sets.
Also:
- Prints results, saves results to the progress tracker.
- Saves the model if the validation score is the best so far
- If there is no validation set, the model is always saved.
Returns whether the trainer should early stop, based on validation metrics history.
"""
start_time = time.time()
self.callback(lambda c: c.on_eval_start(self, progress_tracker, save_path))
progress_tracker.checkpoint_number += 1
if self.is_coordinator():
logger.info(f"\nRunning evaluation for step: {progress_tracker.steps}, epoch: {progress_tracker.epoch}")
# ================ Eval ================
# Run a separate pass over the training data to compute metrics
# Appends results to progress_tracker.train_metrics.
if self.evaluate_training_set:
self.evaluation(
training_set, "train", progress_tracker.train_metrics, self.eval_batch_size, progress_tracker
)
self.write_eval_summary(
summary_writer=train_summary_writer,
metrics=progress_tracker.train_metrics,
step=progress_tracker.steps,
)
if validation_set is not None:
self.callback(lambda c: c.on_validation_start(self, progress_tracker, save_path))
# eval metrics on validation set
self.evaluation(
validation_set,
VALIDATION,
progress_tracker.validation_metrics,
self.eval_batch_size,
progress_tracker,
)
self.write_eval_summary(
summary_writer=validation_summary_writer,
metrics=progress_tracker.validation_metrics,
step=progress_tracker.steps,
)
self.callback(lambda c: c.on_validation_end(self, progress_tracker, save_path))
if test_set is not None:
self.callback(lambda c: c.on_test_start(self, progress_tracker, save_path))
# eval metrics on test set
self.evaluation(test_set, TEST, progress_tracker.test_metrics, self.eval_batch_size, progress_tracker)
self.write_eval_summary(
summary_writer=test_summary_writer,
metrics=progress_tracker.test_metrics,
step=progress_tracker.steps,
)
self.callback(lambda c: c.on_test_end(self, progress_tracker, save_path))
elapsed_time = (time.time() - start_time) * 1000.0
if self.is_coordinator():
logger.info(f"Evaluation took {time_utils.strdelta(elapsed_time)}\n")
print_metrics_table(
output_features,
progress_tracker.train_metrics,
progress_tracker.validation_metrics,
progress_tracker.test_metrics,
)
# Trigger eval end callback after any model weights save for complete checkpoint
self.callback(lambda c: c.on_eval_end(self, progress_tracker, save_path))
return False
@register_llm_trainer("finetune")
class FineTuneTrainer(Trainer):
@staticmethod
def get_schema_cls():
return FineTuneTrainerConfig
def __init__(
self,
config: FineTuneTrainerConfig,
model: LLM,
resume: float = False,
skip_save_model: bool = False,
skip_save_progress: bool = False,
skip_save_log: bool = False,
callbacks: list = None,
report_tqdm_to_ray=False,
random_seed: int = default_random_seed,
distributed: DistributedStrategy | None = None,
device: str | None = None,
**kwargs,
):
super().__init__(
config,
model,
resume,
skip_save_model,
skip_save_progress,
skip_save_log,
callbacks,
report_tqdm_to_ray,
random_seed,
distributed,
device,
**kwargs,
)
def evaluation(self, dataset, dataset_name, metrics_log, batch_size, progress_tracker):
predictor = LlmFineTunePredictor(
self.model, batch_size=batch_size, distributed=self.distributed, report_tqdm_to_ray=self.report_tqdm_to_ray
)
metrics, _, input_target_output_dict = predictor.batch_evaluation(
dataset, collect_predictions=False, dataset_name=dataset_name
)
# Setting collect_predictions=True currently causes an error when doing batch evaluation because the outputs
# can be of variable sizes but we try to concatenate them into a single tensor.
tokenizer = self.dist_model.tokenizer
# There should only be one key in the dict for LLMs
input_key = list(input_target_output_dict["inputs"].keys())[0]
num_examples = min(len(input_target_output_dict["inputs"][input_key]), MAX_EVALUATION_EXAMPLES)
llm_eval_examples = {"inputs": [], "targets": [], "outputs": []}
for key in input_target_output_dict["inputs"]:
for inp in input_target_output_dict["inputs"][key][:num_examples]:
llm_eval_examples["inputs"].append(tokenizer.decode(inp, skip_special_tokens=True))
for key in input_target_output_dict["targets"]:
for tar in input_target_output_dict["targets"][key][:num_examples]:
llm_eval_examples["targets"].append(tokenizer.decode(tar, skip_special_tokens=True))
for key in input_target_output_dict["outputs"]:
for out in input_target_output_dict["outputs"][key][:num_examples]:
llm_eval_examples["outputs"].append(tokenizer.decode(out, skip_special_tokens=True))
num_examples_shown = min(len(llm_eval_examples["inputs"]), MAX_EVALUATION_EXAMPLES_SHOWN)
for i in range(num_examples_shown):
logger.info(f"Input: {llm_eval_examples['inputs'][i].strip()}")
logger.info(f"Output: {llm_eval_examples['outputs'][i].strip()}")
logger.info("--------------------")
progress_tracker.llm_eval_examples = llm_eval_examples
return append_metrics(self.model, dataset_name, metrics, metrics_log, progress_tracker)
def tune_batch_size(
self,
config: ModelConfigDict,
training_set: Dataset,
random_seed: int = default_random_seed,
max_trials: int = 20,
halving_limit: int = 3,
snapshot_weights: bool = True,
on_best_batch_size_updated: Callable[[int, float, int], None] | None = None,
tune_for_training: bool = True,
global_max_sequence_length: int | None = None,
) -> int:
if global_max_sequence_length is None:
global_max_sequence_length = self.model.global_max_sequence_length
return super().tune_batch_size(
config,
training_set,
random_seed,
max_trials,
halving_limit,
snapshot_weights,
on_best_batch_size_updated,
tune_for_training,
global_max_sequence_length,
)
def _create_batch_size_evaluator(self) -> BatchSizeEvaluator:
return LLMFinetuneTrainerBatchSizeEvaluator(self)
def _create_predict_batch_size_evaluator(self) -> BatchSizeEvaluator:
return LLMFinetunePredictBatchSizeEvaluator(self)
class RemoteLLMTrainer(NoneTrainer):
def __init__(self, gpus=None, gpu_memory_limit=None, allow_parallel_threads=True, **kwargs):
super().__init__(**kwargs)
# Only return results from rank 0 to reduce network overhead
self.train = self.distributed.return_first(self.train)
self.train_online = self.distributed.return_first(self.train_online)
class RemoteLLMFineTuneTrainer(FineTuneTrainer):
def __init__(self, gpus=None, gpu_memory_limit=None, allow_parallel_threads=True, **kwargs):
super().__init__(**kwargs)
# Only return results from rank 0 to reduce network overhead
self.train = self.distributed.return_first(self.train)
self.train_online = self.distributed.return_first(self.train_online)
================================================
FILE: ludwig/types.py
================================================
"""Public API: Common typing for Ludwig dictionary parameters."""
from typing import Any
FeatureConfigDict = dict[str, Any]
"""Dictionary of parameters used to configure an input or output feature.
https://ludwig.ai/latest/configuration/features/supported_data_types/
"""
ModelConfigDict = dict[str, Any]
"""Dictionary representation of the ModelConfig object.
https://ludwig.ai/latest/configuration/
"""
TrainingSetMetadataDict = dict[str, Any]
"""Training set metadata, which consists of internal configuration parameters."""
PreprocessingConfigDict = dict[str, Any]
"""Dictionary of parameters used to configure preprocessing.
May be type-defaults global preprocessing or feature-specific preprocessing.
https://ludwig.ai/latest/configuration/preprocessing/
"""
HyperoptConfigDict = dict[str, Any]
"""Dictionary of parameters used to configure hyperopt.
https://ludwig.ai/latest/configuration/hyperparameter_optimization/
"""
TrainerConfigDict = dict[str, Any]
"""Dictionary of parameters used to configure training.
https://ludwig.ai/latest/configuration/trainer/
"""
FeatureTypeDefaultsDict = dict[str, FeatureConfigDict]
"""Dictionary of type to parameters that configure the defaults for that feature type.
https://ludwig.ai/latest/configuration/defaults/
"""
FeatureMetadataDict = dict[str, Any]
"""Metadata for a specific feature like idx2str."""
FeaturePostProcessingOutputDict = dict[str, Any]
"""Output from feature post-processing."""
================================================
FILE: ludwig/upload.py
================================================
import argparse
import logging
import os
import sys
from ludwig.globals import MODEL_FILE_NAME, MODEL_HYPERPARAMETERS_FILE_NAME, MODEL_WEIGHTS_FILE_NAME
from ludwig.utils.print_utils import get_logging_level_registry
from ludwig.utils.upload_utils import HuggingFaceHub, Predibase
logger = logging.getLogger(__name__)
def get_upload_registry():
return {
"hf_hub": HuggingFaceHub,
"predibase": Predibase,
}
def upload_cli(
service: str,
repo_id: str,
model_path: str,
repo_type: str = "model",
private: bool = False,
commit_message: str = "Upload trained [Ludwig](https://ludwig.ai/latest/) model weights",
commit_description: str | None = None,
dataset_file: str | None = None,
dataset_name: str | None = None,
**kwargs,
) -> None:
"""Create an empty repo on the HuggingFace Hub and upload trained model artifacts to that repo.
Args:
service (`str`):
Name of the hosted model service to push the trained artifacts to.
Currently, this only supports `hf_hub` and `predibase`.
repo_id (`str`):
A namespace (user or an organization) and a repo name separated
by a `/`.
model_path (`str`):
The path of the saved model. This is the parent-folder of the folder
where the 'model_weights' folder and the 'model_hyperparameters.json' file
are stored.
private (`bool`, *optional*, defaults to `False`):
Whether the model repo should be private.
repo_type (`str`, *optional*):
Set to `"dataset"` or `"space"` if uploading to a dataset or
space, `None` or `"model"` if uploading to a model. Default is
`None`.
commit_message (`str`, *optional*):
The summary / title / first line of the generated commit. Defaults to:
`f"Upload {path_in_repo} with huggingface_hub"`
commit_description (`str` *optional*):
The description of the generated commit
dataset_file (`str`, *optional*):
The path to the dataset file. Required if `service` is set to
`"predibase"` for new model repos.
dataset_name (`str`, *optional*):
The name of the dataset. Used by the `service`
`"predibase"`.
"""
model_service = get_upload_registry().get(service, "hf_hub")
hub: HuggingFaceHub = model_service()
if os.path.exists(os.path.join(model_path, MODEL_FILE_NAME, MODEL_WEIGHTS_FILE_NAME)) and os.path.exists(
os.path.join(model_path, MODEL_FILE_NAME, MODEL_HYPERPARAMETERS_FILE_NAME)
):
experiment_path = model_path
elif os.path.exists(os.path.join(model_path, MODEL_WEIGHTS_FILE_NAME)) and os.path.exists(
os.path.join(model_path, MODEL_HYPERPARAMETERS_FILE_NAME)
):
experiment_path = os.path.normpath(os.path.join(model_path, ".."))
else:
raise ValueError(
f"Can't find 'model_weights' and '{MODEL_HYPERPARAMETERS_FILE_NAME}' either at "
f"'{model_path}' or at '{model_path}/model'"
)
hub.upload(
repo_id=repo_id,
model_path=experiment_path,
repo_type=repo_type,
private=private,
commit_message=commit_message,
commit_description=commit_description,
dataset_file=dataset_file,
dataset_name=dataset_name,
)
def cli(sys_argv):
parser = argparse.ArgumentParser(
description="This script pushes a trained model to a hosted model repository service",
prog="ludwig upload",
usage="%(prog)s [options]",
)
# ---------------
# Required parameters
# ---------------
parser.add_argument(
"service",
help="Name of the model repository service.",
default="hf_hub",
choices=["hf_hub", "predibase"],
)
parser.add_argument(
"-r",
"--repo_id",
help="Name of the repo. This will be created if it doesn't exist. Format: username/repo_name",
required=True,
)
parser.add_argument("-m", "--model_path", help="Path of the trained model on disk", required=True)
# ---------------
# Optional parameters
# ---------------
parser.add_argument("-p", "--private", help="Make the repo private", default=False, choices=[True, False])
parser.add_argument(
"-t", "--repo_type", help="Type of repo", default="model", choices=["model", "space", "dataset"]
)
parser.add_argument(
"-c",
"--commit_message",
help="The summary / title / first line of the generated commit.",
default="Upload trained [Ludwig](https://ludwig.ai/latest/) model weights",
)
parser.add_argument("-d", "--commit_description", help="The description of the generated commit", default=None)
parser.add_argument(
"-l",
"--logging_level",
default="info",
help="The level of logging to use",
choices=["critical", "error", "warning", "info", "debug", "notset"],
)
parser.add_argument("-df", "--dataset_file", help="The location of the dataset file", default=None)
parser.add_argument(
"-dn", "--dataset_name", help="(Optional) The name of the dataset in the Provider", default=None
)
args = parser.parse_args(sys_argv)
args.logging_level = get_logging_level_registry()[args.logging_level]
logging.getLogger("ludwig").setLevel(args.logging_level)
global logger
logger = logging.getLogger("ludwig.upload")
upload_cli(**vars(args))
if __name__ == "__main__":
cli(sys.argv[1:])
================================================
FILE: ludwig/utils/__init__.py
================================================
================================================
FILE: ludwig/utils/algorithms_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from ludwig.constants import TIED
def topological_sort(graph_unsorted):
"""Repeatedly go through all of the nodes in the graph, moving each of the nodes that has all its edges
resolved, onto a sequence that forms our sorted graph.
A node has all of its edges resolved and can be moved once all the nodes its edges point to, have been moved from
the unsorted graph onto the sorted one.
"""
# This is the list we'll return, that stores each node/edges pair
# in topological order.
graph_sorted = []
# Convert the unsorted graph into a hash table. This gives us
# constant-time lookup for checking if edges are unresolved, and
# for removing nodes from the unsorted graph.
graph_unsorted = dict(graph_unsorted)
# Run until the unsorted graph is empty.
while graph_unsorted:
# Go through each of the node/edges pairs in the unsorted
# graph. If a set of edges does not contain any nodes that
# haven't been resolved, that is, that are still in the
# unsorted graph, remove the pair from the unsorted graph,
# and append it to the sorted graph. Note here that by using
# using the items() method for iterating, a copy of the
# unsorted graph is used, allowing us to modify the unsorted
# graph as we move through it. We also keep a flag for
# checking that that graph is acyclic, which is true if any
# nodes are resolved during each pass through the graph. If
# not, we need to bail out as the graph therefore can't be
# sorted.
acyclic = False
for node, edges in list(graph_unsorted.items()):
if edges is None:
edges = []
for edge in edges:
if edge in graph_unsorted:
break
else:
acyclic = True
del graph_unsorted[node]
graph_sorted.append((node, edges))
if not acyclic:
# Uh oh, we've passed through all the unsorted nodes and
# weren't able to resolve any of them, which means there
# are nodes with cyclic edges that will never be resolved,
# so we bail out with an error.
raise RuntimeError("A cyclic dependency occurred")
return graph_sorted
def topological_sort_feature_dependencies(features):
# topological sorting of output features for resolving dependencies
dependencies_graph = {}
output_features_dict = {}
for feature in features:
dependencies = []
if "dependencies" in feature:
dependencies.extend(feature["dependencies"])
if TIED in feature:
dependencies.append(feature[TIED])
dependencies_graph[feature["name"]] = dependencies
output_features_dict[feature["name"]] = feature
return [output_features_dict[node[0]] for node in topological_sort(dependencies_graph)]
================================================
FILE: ludwig/utils/audio_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import functools
import logging
from io import BytesIO
from typing import Any
import torch
import torch.nn.functional as F
import torchaudio
from packaging import version
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DEFAULT_AUDIO_TENSOR_LENGTH
from ludwig.utils.types import TorchAudioTuple
logger = logging.getLogger(__name__)
# https://github.com/pytorch/audio/blob/main/torchaudio/csrc/sox/types.cpp
AUDIO_EXTENSIONS = (".wav", ".amb", ".mp3", ".ogg", ".vorbis", ".flac", ".opus", ".sphere")
_TORCH_AUDIO_210 = version.parse(torchaudio.__version__) >= version.parse("2.1.0")
_TORCH_AUDIO_201 = version.parse(torchaudio.__version__) >= version.parse("2.0.1")
@DeveloperAPI
def is_torch_audio_tuple(audio: Any) -> bool:
if isinstance(audio, tuple):
if len(audio) == 2 and isinstance(audio[0], torch.Tensor) and isinstance(audio[1], int):
return True
return False
@DeveloperAPI
def get_default_audio(audio_lst: list[TorchAudioTuple]) -> TorchAudioTuple:
if not audio_lst:
# Return a silent audio tensor as default when no valid audio is available
default_audio_tensor = torch.zeros(1, DEFAULT_AUDIO_TENSOR_LENGTH)
return default_audio_tensor, 16000
sampling_rates = [audio[1] for audio in audio_lst]
tensor_list = [audio[0] for audio in audio_lst]
for i, tensor in enumerate(tensor_list):
if tensor.shape[1] > DEFAULT_AUDIO_TENSOR_LENGTH:
tensor_list[i] = tensor[:, :DEFAULT_AUDIO_TENSOR_LENGTH]
else:
pad_size = DEFAULT_AUDIO_TENSOR_LENGTH - tensor.shape[1]
tensor_list[i] = F.pad(tensor, (0, pad_size))
default_audio_tensor = torch.mean(torch.stack(tensor_list), dim=0)
default_sampling_rate = calculate_mean(sum(sampling_rates), len(sampling_rates))
return default_audio_tensor, default_sampling_rate
@DeveloperAPI
def read_audio_from_path(path: str) -> TorchAudioTuple | None:
"""Reads audio from path.
Useful for reading from a small number of paths. For more intensive reads, use backend.read_binary_files instead.
"""
try:
if _TORCH_AUDIO_210:
return torchaudio.load(path, backend="sox")
elif _TORCH_AUDIO_201:
return torchaudio.backend.sox_io_backend.load(path)
else:
return torchaudio.backend.sox_backend.load(path)
except Exception as e:
logger.warning(e)
return None
@DeveloperAPI
@functools.lru_cache(maxsize=32)
def read_audio_from_bytes_obj(bytes_obj: bytes) -> TorchAudioTuple | None:
try:
f = BytesIO(bytes_obj)
return torchaudio.load(f)
except Exception as e:
logger.warning(e)
return None
def _pre_emphasize_data(data: torch.Tensor, emphasize_value: float = 0.97):
# Increase precision in order to achieve parity with scipy.signal.lfilter implementation
filter_window = torch.tensor([1.0, -emphasize_value], dtype=torch.float64, device=data.device)
a_coeffs = torch.tensor([1, 0], dtype=torch.float64, device=data.device)
pre_emphasized_data = torchaudio.functional.lfilter(
data.to(dtype=torch.float64),
a_coeffs,
filter_window,
clamp=False,
).to(torch.float32)
return pre_emphasized_data
@DeveloperAPI
def get_length_in_samp(sampling_rate_in_hz: float | int, length_in_s: float | int) -> int:
return int(sampling_rate_in_hz * length_in_s)
@DeveloperAPI
def get_group_delay(
raw_data: torch.Tensor,
sampling_rate_in_hz: int,
window_length_in_s: float,
window_shift_in_s: float,
num_fft_points: int,
window_type: str,
):
X_stft_transform = _get_stft(
raw_data, sampling_rate_in_hz, window_length_in_s, window_shift_in_s, num_fft_points, window_type=window_type
)
Y_stft_transform = _get_stft(
raw_data,
sampling_rate_in_hz,
window_length_in_s,
window_shift_in_s,
num_fft_points,
window_type=window_type,
data_transformation="group_delay",
)
X_stft_transform_real = torch.real(X_stft_transform)
X_stft_transform_imag = torch.imag(X_stft_transform)
Y_stft_transform_real = torch.real(Y_stft_transform)
Y_stft_transform_imag = torch.imag(Y_stft_transform)
nominator = torch.multiply(X_stft_transform_real, Y_stft_transform_real) + torch.multiply(
X_stft_transform_imag, Y_stft_transform_imag
)
denominator = torch.square(torch.abs(X_stft_transform))
group_delay = torch.divide(nominator, denominator + 1e-10)
assert not torch.isnan(group_delay).any(), "There are NaN values in group delay"
return torch.transpose(group_delay, 0, 1)
@DeveloperAPI
def get_phase_stft_magnitude(
raw_data: torch.Tensor,
sampling_rate_in_hz: int,
window_length_in_s: float,
window_shift_in_s: float,
num_fft_points: int,
window_type: str,
) -> torch.Tensor:
stft = _get_stft(
raw_data, sampling_rate_in_hz, window_length_in_s, window_shift_in_s, num_fft_points, window_type=window_type
)
abs_stft = torch.abs(stft)
phase = torch.angle(stft)
stft_phase = torch.cat([phase, abs_stft], dim=1)
return torch.transpose(stft_phase, 0, 1)
@DeveloperAPI
def get_stft_magnitude(
raw_data: torch.Tensor,
sampling_rate_in_hz: int,
window_length_in_s: float,
window_shift_in_s: float,
num_fft_points: int,
window_type: str,
):
stft = _get_stft(
raw_data, sampling_rate_in_hz, window_length_in_s, window_shift_in_s, num_fft_points, window_type=window_type
)
stft_magnitude = torch.abs(stft)
return torch.transpose(stft_magnitude, 0, 1)
################################################################################
# The following code for FBank is adapted from jameslyons/python_speech_features
# MIT licensed implementation
# https://github.com/jameslyons/python_speech_features/blob/40c590269b57c64a8c1f1ddaaff2162008d1850c/python_speech_features/base.py#L84################################################################################
################################################################################
@DeveloperAPI
def get_fbank(
raw_data: torch.Tensor,
sampling_rate_in_hz: int,
window_length_in_s: float,
window_shift_in_s: float,
num_fft_points: int,
window_type: str,
num_filter_bands: int,
) -> torch.Tensor:
stft = _get_stft(
raw_data,
sampling_rate_in_hz,
window_length_in_s,
window_shift_in_s,
num_fft_points,
window_type=window_type,
zero_mean_offset=True,
)
stft_power = torch.abs(stft) ** 2
upper_limit_freq = int(sampling_rate_in_hz / 2)
upper_limit_mel = _convert_hz_to_mel(upper_limit_freq)
lower_limit_mel = 0
list_mel_points = torch.linspace(lower_limit_mel, upper_limit_mel, num_filter_bands + 2, device=raw_data.device)
mel_fbank_matrix = _get_mel_fbank_matrix(list_mel_points, num_filter_bands, num_fft_points, sampling_rate_in_hz)
mel_fbank_feature = torch.matmul(stft_power, torch.transpose(mel_fbank_matrix, 0, 1))
log_mel_fbank_feature = torch.log(mel_fbank_feature + 1.0e-10)
return torch.transpose(log_mel_fbank_feature, 0, 1)
def _get_mel_fbank_matrix(
list_mel_points: torch.Tensor, num_filter_bands: int, num_fft_points: int, sampling_rate_in_hz: int
) -> torch.Tensor:
num_ess_fft_points = get_non_symmetric_length(num_fft_points)
freq_scale = (num_fft_points + 1) / sampling_rate_in_hz
freq_bins_on_mel_scale = torch.floor(freq_scale * _convert_mel_to_hz(list_mel_points))
mel_scaled_fbank = torch.zeros(
(num_filter_bands, num_ess_fft_points), dtype=torch.float32, device=list_mel_points.device
)
for filt_idx in range(num_filter_bands):
start_bin_freq = freq_bins_on_mel_scale[filt_idx]
middle_bin_freq = freq_bins_on_mel_scale[filt_idx + 1]
end_bin_freq = freq_bins_on_mel_scale[filt_idx + 2]
mel_scaled_fbank[filt_idx] = _create_triangular_filter(
start_bin_freq, middle_bin_freq, end_bin_freq, num_ess_fft_points
)
return mel_scaled_fbank
def _create_triangular_filter(
start_bin_freq: torch.Tensor, middle_bin_freq: torch.Tensor, end_bin_freq: torch.Tensor, num_ess_fft_points: int
):
filter_window = torch.zeros(num_ess_fft_points, dtype=torch.float32, device=start_bin_freq.device)
filt_support_begin = middle_bin_freq - start_bin_freq
filt_support_end = end_bin_freq - middle_bin_freq
for freq in range(int(start_bin_freq), int(middle_bin_freq)):
filter_window[freq] = (freq - start_bin_freq) / filt_support_begin
for freq in range(int(middle_bin_freq), int(end_bin_freq)):
filter_window[freq] = (end_bin_freq - freq) / filt_support_end
return filter_window
def _convert_hz_to_mel(hz: int) -> float:
return float(2595.0 * torch.log10(torch.tensor(1 + hz / 700.0)))
def _convert_mel_to_hz(mel):
return 700.0 * (10 ** (mel / 2595.0) - 1)
def _get_stft(
raw_data: torch.Tensor,
sampling_rate_in_hz: int,
window_length_in_s: float,
window_shift_in_s: float,
num_fft_points: int,
window_type: str,
data_transformation: str | None = None,
zero_mean_offset: bool = False,
) -> torch.Tensor:
pre_emphasized_data = _pre_emphasize_data(raw_data)
stft = _short_time_fourier_transform(
pre_emphasized_data,
sampling_rate_in_hz,
window_length_in_s,
window_shift_in_s,
num_fft_points,
window_type,
data_transformation,
zero_mean_offset,
)
non_symmetric_stft = get_non_symmetric_data(stft)
return non_symmetric_stft
def _short_time_fourier_transform(
data: torch.Tensor,
sampling_rate_in_hz: int,
window_length_in_s: float,
window_shift_in_s: float,
num_fft_points: int,
window_type: str,
data_transformation: str | None = None,
zero_mean_offset: bool = False,
) -> torch.Tensor:
window_length_in_samp: int = get_length_in_samp(window_length_in_s, sampling_rate_in_hz)
window_shift_in_samp: int = get_length_in_samp(window_shift_in_s, sampling_rate_in_hz)
preprocessed_data_matrix = _preprocess_to_padded_matrix(
data[0], window_length_in_samp, window_shift_in_samp, zero_mean_offset=zero_mean_offset
)
weighted_data_matrix = _weight_data_matrix(
preprocessed_data_matrix, window_type, data_transformation=data_transformation
)
fft = torch.fft.fft(weighted_data_matrix, n=num_fft_points)
return fft
def _preprocess_to_padded_matrix(
data: torch.Tensor, window_length_in_samp: int, window_shift_in_samp: int, zero_mean_offset: bool = False
) -> torch.Tensor:
num_input = data.shape[0]
num_output = get_num_output_padded_to_fit_input(num_input, window_length_in_samp, window_shift_in_samp)
zero_padded_matrix = torch.zeros((num_output, window_length_in_samp), dtype=torch.float32, device=data.device)
for num_output_idx in range(num_output):
start_idx = window_shift_in_samp * num_output_idx
is_last_output = num_output_idx == num_output - 1
end_idx = start_idx + window_length_in_samp if not is_last_output else num_input
end_padded_idx = window_length_in_samp if not is_last_output else end_idx - start_idx
window_data = data[start_idx:end_idx]
if zero_mean_offset:
window_data = window_data - torch.mean(window_data)
zero_padded_matrix[num_output_idx, :end_padded_idx] = window_data
return zero_padded_matrix
@DeveloperAPI
def get_num_output_padded_to_fit_input(num_input: int, window_length_in_samp: int, window_shift_in_samp: int) -> int:
num_output_valid = torch.tensor((num_input - window_length_in_samp) / window_shift_in_samp + 1)
return int(torch.ceil(num_output_valid))
@DeveloperAPI
def get_window(window_type: str, window_length_in_samp: int, device: torch.device | None = None) -> torch.Tensor:
# Increase precision in order to achieve parity with scipy.signal.windows.get_window implementation
if window_type == "bartlett":
return torch.bartlett_window(window_length_in_samp, periodic=False, dtype=torch.float64, device=device).to(
torch.float32
)
elif window_type == "blackman":
return torch.blackman_window(window_length_in_samp, periodic=False, dtype=torch.float64, device=device).to(
torch.float32
)
elif window_type == "hamming":
return torch.hamming_window(window_length_in_samp, periodic=False, dtype=torch.float64, device=device).to(
torch.float32
)
elif window_type == "hann":
return torch.hann_window(window_length_in_samp, periodic=False, dtype=torch.float64, device=device).to(
torch.float32
)
else:
raise ValueError(f"Unknown window type: {window_type}")
@DeveloperAPI
def is_audio_score(src_path):
# Used for AutoML
return int(isinstance(src_path, str) and src_path.lower().endswith(AUDIO_EXTENSIONS))
def _weight_data_matrix(
data_matrix: torch.Tensor, window_type: str, data_transformation: str | None = None
) -> torch.Tensor:
window_length_in_samp = data_matrix[0].shape[0]
window = get_window(window_type, window_length_in_samp, device=data_matrix.device)
if data_transformation is not None and data_transformation == "group_delay":
window *= torch.arange(window_length_in_samp, device=data_matrix.device).float()
return data_matrix * window
@DeveloperAPI
def get_non_symmetric_length(symmetric_length: int) -> int:
return int(symmetric_length / 2) + 1
@DeveloperAPI
def get_non_symmetric_data(data: torch.Tensor) -> torch.Tensor:
num_fft_points = data.shape[-1]
num_ess_fft_points = get_non_symmetric_length(num_fft_points)
return data[:, :num_ess_fft_points]
@DeveloperAPI
def get_max_length_stft_based(length_in_samp, window_length_in_s, window_shift_in_s, sampling_rate_in_hz):
window_length_in_samp = get_length_in_samp(window_length_in_s, sampling_rate_in_hz)
window_shift_in_samp = get_length_in_samp(window_shift_in_s, sampling_rate_in_hz)
return get_num_output_padded_to_fit_input(length_in_samp, window_length_in_samp, window_shift_in_samp)
@DeveloperAPI
def calculate_incr_var(var_prev, mean_prev, mean, length):
return var_prev + (length - mean_prev) * (length - mean)
@DeveloperAPI
def calculate_incr_mean(count, mean, length):
return mean + (length - mean) / float(count)
@DeveloperAPI
def calculate_var(sum1, sum2, count):
return (sum2 - ((sum1 * sum1) / float(count))) / float(count - 1) if count > 1 else 0.0
@DeveloperAPI
def calculate_mean(sum1, count):
return sum1 / float(count)
================================================
FILE: ludwig/utils/augmentation_utils.py
================================================
from ludwig.api_annotations import DeveloperAPI
from ludwig.utils.registry import Registry
###
# Registry for augmentation operations
# Each augmentation operation is registered with the feature type it is applicable to
# and the name of the operation.
###
_augmentation_op_registry = Registry()
@DeveloperAPI
def get_augmentation_op_registry() -> Registry:
return _augmentation_op_registry
@DeveloperAPI
def register_augmentation_op(name: str, features: str | list[str]):
if isinstance(features, str):
features = [features]
def wrap(cls):
for feature in features:
augmentation_op_registry = get_augmentation_op_registry().get(feature, {})
augmentation_op_registry[name] = cls
get_augmentation_op_registry()[feature] = augmentation_op_registry
return cls
return wrap
@DeveloperAPI
def get_augmentation_op(feature_type: str, op_name: str):
return get_augmentation_op_registry()[feature_type][op_name]
class AugmentationPipelines:
"""Container holding augmentation pipelines defined in the model."""
def __init__(self, augmentation_pipelines: dict):
self.augmentation_pipelines = augmentation_pipelines
def __getitem__(self, key):
return self.augmentation_pipelines[key]
def __contains__(self, key):
return key in self.augmentation_pipelines
def __len__(self):
return len(self.augmentation_pipelines)
def __iter__(self):
return self.augmentation_pipelines.__iter__()
def items(self):
return self.augmentation_pipelines.items()
================================================
FILE: ludwig/utils/automl/__init__.py
================================================
================================================
FILE: ludwig/utils/automl/data_source.py
================================================
from abc import ABC, abstractmethod
import dask.dataframe as dd
import pandas as pd
from ludwig.api_annotations import DeveloperAPI
from ludwig.utils.audio_utils import is_audio_score
from ludwig.utils.automl.utils import avg_num_tokens
from ludwig.utils.image_utils import is_image_score
from ludwig.utils.misc_utils import memoized_method
from ludwig.utils.types import DataFrame
@DeveloperAPI
class DataSource(ABC):
@property
@abstractmethod
def columns(self) -> list[str]:
raise NotImplementedError()
@abstractmethod
def get_dtype(self, column: str) -> str:
raise NotImplementedError()
@abstractmethod
def get_distinct_values(self, column: str, max_values_to_return: int) -> tuple[int, list[str], float]:
raise NotImplementedError()
@abstractmethod
def get_nonnull_values(self, column: str) -> int:
raise NotImplementedError()
@abstractmethod
def get_avg_num_tokens(self, column: str) -> int:
raise NotImplementedError()
@abstractmethod
def is_string_type(self, dtype: str) -> bool:
raise NotImplementedError()
@abstractmethod
def size_bytes(self) -> int:
raise NotImplementedError()
@abstractmethod
def __len__(self) -> int:
raise NotImplementedError()
@DeveloperAPI
class DataframeSourceMixin:
df: DataFrame
@property
def columns(self) -> list[str]:
return self.df.columns
def get_dtype(self, column: str) -> str:
return self.df[column].dtype.name
def get_distinct_values(self, column, max_values_to_return: int) -> tuple[int, list[str], float]:
unique_values = self.df[column].dropna().unique()
num_unique_values = len(unique_values)
unique_values_counts = self.df[column].value_counts()
if len(unique_values_counts) != 0:
unique_majority_values = unique_values_counts[unique_values_counts.idxmax()]
unique_minority_values = unique_values_counts[unique_values_counts.idxmin()]
unique_values_balance = unique_minority_values / unique_majority_values
else:
unique_values_balance = 1.0
return num_unique_values, unique_values[:max_values_to_return], unique_values_balance
def get_nonnull_values(self, column: str) -> int:
return len(self.df[column].notnull())
def get_image_values(self, column: str, sample_size: int = 10) -> int:
return int(sum(is_image_score(x) for x in self.df[column].head(sample_size)))
def get_audio_values(self, column: str, sample_size: int = 10) -> int:
return int(sum(is_audio_score(x) for x in self.df[column].head(sample_size)))
def get_avg_num_tokens(self, column: str) -> int:
return avg_num_tokens(self.df[column])
def is_string_type(self, dtype: str) -> bool:
return dtype in ["str", "string", "object"]
def size_bytes(self) -> int:
return sum(self.df.memory_usage(deep=True))
def __len__(self) -> int:
return len(self.df)
@DeveloperAPI
class DataframeSource(DataframeSourceMixin, DataSource):
def __init__(self, df):
self.df = df
@DeveloperAPI
class DaskDataSource(DataframeSource):
@memoized_method(maxsize=1)
def get_sample(self) -> pd.DataFrame:
# TODO: uniform random sample
return self.df.head(10000)
@property
def sample(self) -> pd.DataFrame:
return self.get_sample()
def get_distinct_values(self, column, max_values_to_return) -> tuple[int, list[str], float]:
unique_values = self.df[column].drop_duplicates().dropna().persist()
num_unique_values = len(unique_values)
# TODO(travis): implement imbalance ratio
imbalance_ratio = 1.0
return num_unique_values, unique_values.head(max_values_to_return), imbalance_ratio
def get_nonnull_values(self, column) -> int:
return self.df[column].notnull().sum().compute()
def get_image_values(self, column: str, sample_size: int = 10) -> int:
return int(sum(is_image_score(x) for x in self.sample[column].head(sample_size)))
def get_audio_values(self, column: str, sample_size: int = 10) -> int:
return int(sum(is_audio_score(x) for x in self.sample[column].head(sample_size)))
def get_avg_num_tokens(self, column) -> int:
return avg_num_tokens(self.sample[column])
@DeveloperAPI
def wrap_data_source(df: DataFrame) -> DataSource:
if isinstance(df, dd.DataFrame):
return DaskDataSource(df)
return DataframeSource(df)
================================================
FILE: ludwig/utils/automl/field_info.py
================================================
from dataclasses import dataclass
from dataclasses_json import dataclass_json, LetterCase
from ludwig.api_annotations import DeveloperAPI
@DeveloperAPI
@dataclass_json(letter_case=LetterCase.CAMEL)
@dataclass
class FieldInfo:
name: str
dtype: str
key: str = None
distinct_values: list = None
distinct_values_balance: float = 1.0
num_distinct_values: int = 0
nonnull_values: int = 0
image_values: int = 0
audio_values: int = 0
avg_words: int = None
@DeveloperAPI
@dataclass_json(letter_case=LetterCase.CAMEL)
@dataclass
class FieldConfig:
name: str
column: str
type: str
@DeveloperAPI
@dataclass_json(letter_case=LetterCase.CAMEL)
@dataclass
class FieldMetadata:
name: str
config: FieldConfig
excluded: bool
mode: str
missing_values: float
imbalance_ratio: float
================================================
FILE: ludwig/utils/automl/ray_utils.py
================================================
import os
from ludwig.backend.ray import initialize_ray
try:
import ray
except ImportError:
raise ImportError(" ray is not installed. " "In order to use auto_train please run " "pip install ludwig[ray]")
def _ray_init():
if ray.is_initialized():
return
# Forcibly terminate trial requested to stop after this amount of time passes
os.environ.setdefault("TUNE_FORCE_TRIAL_CLEANUP_S", "120")
initialize_ray()
================================================
FILE: ludwig/utils/automl/type_inference.py
================================================
import logging
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import AUDIO, BINARY, CATEGORY, DATE, IMAGE, NUMBER, TEXT
from ludwig.utils import strings_utils
from ludwig.utils.automl.field_info import FieldInfo
# For a given feature, the highest percentage of distinct values out of the total number of rows that we might still
# assign the CATEGORY type.
CATEGORY_TYPE_DISTINCT_VALUE_PERCENTAGE_CUTOFF = 0.5
# Consider the field a valid text field if it has at least 5 average words. Fewer than this and it may be a cateogry
# or an ID field (like a name or place) of some kind.
TEXT_AVG_WORDS_CUTOFF = 5
@DeveloperAPI
def infer_type(field: FieldInfo, missing_value_percent: float, row_count: int) -> str:
"""Perform type inference on field.
# Inputs
:param field: (FieldInfo) object describing field
:param missing_value_percent: (float) percent of missing values in the column
:param row_count: (int) total number of entries in original dataset # Return
:return: (str) feature type
"""
if field.dtype == DATE or field.dtype.startswith("datetime"):
return DATE
num_distinct_values = field.num_distinct_values
distinct_values = field.distinct_values
if num_distinct_values <= 1:
return CATEGORY
if num_distinct_values == 2 and missing_value_percent == 0:
# Check that all distinct values are conventional bools.
if strings_utils.are_conventional_bools(distinct_values):
return BINARY
if field.image_values >= 3:
return IMAGE
if field.audio_values >= 3:
return AUDIO
if strings_utils.are_all_datetimes(distinct_values):
return DATE
# Use CATEGORY if:
# - The number of distinct values is significantly less than the total number of examples.
# - The distinct values are not all numbers.
# - The distinct values are all numbers but comprise of a perfectly sequential list of integers that suggests the
# values represent categories.
valid_row_count = row_count * (1.0 - missing_value_percent)
if num_distinct_values < valid_row_count * CATEGORY_TYPE_DISTINCT_VALUE_PERCENTAGE_CUTOFF and (
(not strings_utils.are_all_numbers(distinct_values)) or strings_utils.are_sequential_integers(distinct_values)
):
return CATEGORY
# Use NUMBER if all of the distinct values are numbers.
if strings_utils.are_all_numbers(distinct_values):
return NUMBER
# TODO (ASN): add other modalities (image, etc. )
# Fallback to TEXT.
return TEXT
@DeveloperAPI
def should_exclude(
idx: int, field: FieldInfo, dtype: str, column_count: int, row_count: int, targets: set[str]
) -> bool:
if field.key == "PRI":
logging.info(f"Exclude {field.name} ({dtype}): primary key")
return True
if field.name in targets:
return False
if field.num_distinct_values <= 1:
logging.info(f"Exclude {field.name} ({dtype}): less than 2 distinct values")
return True
distinct_value_percent = float(field.num_distinct_values) / row_count
if distinct_value_percent == 1.0:
upper_name = field.name.upper()
if (
(idx == 0 and "INDEX" in upper_name and dtype == NUMBER)
or upper_name.endswith("ID")
or upper_name.startswith("ID")
):
logging.info(f"Exclude {field.name} ({dtype}): unique ID column")
return True
# For TEXT fields, we only want to use them if they appear "interesting", otherwise we would rather exclude
# them and treat the problem as a tabular problem
if column_count > 3 and dtype == TEXT and (field.avg_words or 0) < TEXT_AVG_WORDS_CUTOFF:
logging.info(f"Exclude {field.name} ({dtype}): too few average words")
return True
return False
================================================
FILE: ludwig/utils/automl/utils.py
================================================
import bisect
import logging
from numpy import nan_to_num
from pandas import Series
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
BINARY,
CATEGORY,
COMBINER,
CONFIG,
HYPEROPT,
IMBALANCE_DETECTION_RATIO,
NAME,
NUMBER,
PARAMETERS,
SEARCH_ALG,
TRAINER,
TYPE,
)
from ludwig.features.feature_registries import get_output_type_registry
from ludwig.modules.metric_registry import get_metric_objective
from ludwig.schema.combiners.utils import get_combiner_jsonschema
logger = logging.getLogger(__name__)
@DeveloperAPI
def avg_num_tokens_decoder(x):
if x is None:
return None
if type(x) is bytes:
return x.decode("utf-8")
return str(x)
@DeveloperAPI
def avg_num_tokens(field: Series) -> int:
logger.info(f"Calculating average number tokens for field {field.name} using sample of 100 rows.")
field_sample = field.head(100).apply(avg_num_tokens_decoder)
unique_entries = field_sample.unique()
avg_words = round(nan_to_num(Series(unique_entries).str.split().str.len().mean()))
return avg_words
@DeveloperAPI
def get_model_type(config: dict) -> str:
if (
"input_features" in config
and len(config["input_features"]) == 1
and "type" in config["input_features"][0]
and config["input_features"][0]["type"] == "text"
):
model_type = "text"
elif COMBINER in config and TYPE in config[COMBINER]:
model_type = config[COMBINER][TYPE]
else:
default_combiner_type = get_combiner_jsonschema()["properties"]["type"]["default"]
model_type = default_combiner_type
return model_type
# ref_configs comes from a file storing the config for a high-performing model per reference dataset.
# If the automl model type matches that of any reference models, set the initial point_to_evaluate
# in the automl hyperparameter search to the config of the reference model with the closest-matching
# input number columns ratio. This model config "transfer learning" can improve the automl search.
def _add_transfer_config(base_config: dict, ref_configs: dict) -> dict:
base_model_type = base_config[COMBINER][TYPE]
base_model_numeric_ratio = _get_ratio_numeric_input_features(base_config["input_features"])
min_numeric_ratio_distance = 1.0
min_dataset = None
for dataset in ref_configs["datasets"]:
dataset_config = dataset[CONFIG]
if base_model_type == dataset_config[COMBINER][TYPE]:
dataset_numeric_ratio = _get_ratio_numeric_input_features(dataset_config["input_features"])
ratio_distance = abs(base_model_numeric_ratio - dataset_numeric_ratio)
if ratio_distance <= min_numeric_ratio_distance:
min_numeric_ratio_distance = ratio_distance
min_dataset = dataset
if min_dataset is not None:
logger.info("Transfer config from dataset {}".format(min_dataset["name"]))
min_dataset_config = min_dataset[CONFIG]
hyperopt_params = base_config[HYPEROPT][PARAMETERS]
point_to_evaluate = {}
_add_option_to_evaluate(point_to_evaluate, min_dataset_config, hyperopt_params, COMBINER)
_add_option_to_evaluate(point_to_evaluate, min_dataset_config, hyperopt_params, TRAINER)
base_config[HYPEROPT][SEARCH_ALG]["points_to_evaluate"] = [point_to_evaluate]
return base_config
def _get_ratio_numeric_input_features(input_features: dict) -> float:
num_input_features = len(input_features)
num_numeric_input = 0
for input_feature in input_features:
if input_feature[TYPE] == NUMBER:
num_numeric_input = num_numeric_input + 1
return num_numeric_input / num_input_features
# Update point_to_evaluate w/option value from dataset_config for options in hyperopt_params.
# Also, add option value to associated categories list if it is not already included.
def _add_option_to_evaluate(
point_to_evaluate: dict, dataset_config: dict, hyperopt_params: dict, option_type: str
) -> dict:
options = dataset_config[option_type]
for option in options.keys():
option_param = option_type + "." + option
if option_param in hyperopt_params.keys():
option_val = options[option]
point_to_evaluate[option_param] = option_val
if option_val not in hyperopt_params[option_param]["categories"]:
bisect.insort(hyperopt_params[option_param]["categories"], option_val)
return point_to_evaluate
@DeveloperAPI
def set_output_feature_metric(base_config):
"""If single output feature, set trainer and hyperopt metric and goal for that feature if not set."""
if len(base_config["output_features"]) != 1:
# If multiple output features, ludwig uses the goal of minimizing combined loss;
# this could be revisited/refined in the future.
return base_config
output_name = base_config["output_features"][0][NAME]
output_type = base_config["output_features"][0][TYPE]
output_metric = get_output_type_registry()[output_type].get_schema_cls().default_validation_metric
output_goal = get_metric_objective(output_metric)
if "validation_field" not in base_config[TRAINER] and "validation_metric" not in base_config[TRAINER]:
base_config[TRAINER]["validation_field"] = output_name
base_config[TRAINER]["validation_metric"] = output_metric
if (
"output_feature" not in base_config[HYPEROPT]
and "metric" not in base_config[HYPEROPT]
and "goal" not in base_config[HYPEROPT]
):
base_config[HYPEROPT]["output_feature"] = output_name
base_config[HYPEROPT]["metric"] = output_metric
base_config[HYPEROPT]["goal"] = output_goal
return base_config
@DeveloperAPI
def has_imbalanced_output(base_config, features_metadata) -> bool:
"""Check binary and category output feature(s) for imbalance, i.e., low minority/majority instance count
ratio."""
imbalanced_output = False
for output_feature in base_config["output_features"]:
if output_feature[TYPE] == BINARY or output_feature[TYPE] == CATEGORY:
for feature_metadata in features_metadata:
if output_feature[NAME] == feature_metadata.name:
if feature_metadata.imbalance_ratio < IMBALANCE_DETECTION_RATIO:
logger.info(
f"Imbalance in {output_feature[NAME]}: minority/majority={feature_metadata.imbalance_ratio}"
)
imbalanced_output = True
break
return imbalanced_output
================================================
FILE: ludwig/utils/backward_compatibility.py
================================================
#! /usr/bin/env python
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import copy
import logging
import warnings
from collections.abc import Callable
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
AUDIO,
BIAS,
CLASS_WEIGHTS,
COLUMN,
CONV_BIAS,
CONV_USE_BIAS,
DECODER,
DEFAULT_BIAS,
DEFAULT_USE_BIAS,
DEFAULTS,
ENCODER,
EVAL_BATCH_SIZE,
EXECUTOR,
FORCE_SPLIT,
HEIGHT,
HYPEROPT,
IMAGE,
INPUT_FEATURES,
LOSS,
MISSING_VALUE_STRATEGY,
MODEL_ECD,
NAME,
NUM_SAMPLES,
NUMBER,
OUTPUT_FEATURES,
PARAMETERS,
PREPROCESSING,
PROBABILITIES,
RANDOM,
RAY,
SAMPLER,
SCHEDULER,
SEARCH_ALG,
SEQUENCE,
SPLIT,
SPLIT_PROBABILITIES,
STRATIFY,
TEXT,
TIMESERIES,
TRAINER,
TRAINING,
TYPE,
USE_BIAS,
WIDTH,
)
from ludwig.features.feature_registries import get_base_type_registry, get_input_type_registry, get_output_type_registry
from ludwig.globals import LUDWIG_VERSION
from ludwig.schema.encoders.utils import get_encoder_cls
from ludwig.types import (
FeatureConfigDict,
FeatureTypeDefaultsDict,
HyperoptConfigDict,
ModelConfigDict,
PreprocessingConfigDict,
TrainerConfigDict,
TrainingSetMetadataDict,
)
from ludwig.utils.metric_utils import TrainerMetric
from ludwig.utils.misc_utils import get_from_registry, merge_dict
from ludwig.utils.version_transformation import VersionTransformation, VersionTransformationRegistry
config_transformation_registry = VersionTransformationRegistry()
@DeveloperAPI
def register_config_transformation(version: str, prefixes: str | list[str] = []) -> Callable:
"""This decorator registers a transformation function for a config version. Version is the first version which
requires the transform. For example, since "training" is renamed to "trainer" in 0.5, this change should be
registered with 0.5. from_version < version <= to_version.
Args:
version: The version to register this transformation with. The earliest ludwig version which requires this
transformation.
prefixes: A list of keypath prefixes to apply this transformation to. If not specified, transforms the entire
config dict. If a prefix indicates a list, i.e. "input_features", the transformation is applied to
each element of the list (each input feature).
"""
if isinstance(prefixes, str):
prefixes = [prefixes]
def wrap(fn: Callable[[dict], dict]):
config_transformation_registry.register(VersionTransformation(transform=fn, version=version, prefixes=prefixes))
return fn
return wrap
@DeveloperAPI
def upgrade_config_dict_to_latest_version(config: ModelConfigDict) -> ModelConfigDict:
"""Updates config from an older version of Ludwig to the current version. If config does not have a
"ludwig_version" key, all updates are applied.
Args:
config: A config saved by an older version of Ludwig.
Returns A new copy of config, upgraded to the current Ludwig version. Returns config if config has no
"ludwig_version".
"""
return config_transformation_registry.update_config(
config, from_version=config.get("ludwig_version", "0.0"), to_version=LUDWIG_VERSION
)
def upgrade_model_progress(model_progress: dict) -> dict:
"""Updates model progress info to be compatible with latest ProgressTracker implementation.
Notably, we convert epoch-based stats to their step-based equivalents and reformat metrics into `TrainerMetric`
tuples.
"""
ret = copy.deepcopy(model_progress)
if "last_improvement_epoch" in ret:
ret["last_improvement_steps"] = ret["last_improvement_epoch"] * ret["batch_size"]
del ret["last_improvement_epoch"]
if "last_learning_rate_reduction_epoch" in ret:
ret["last_learning_rate_reduction_steps"] = ret["last_learning_rate_reduction_epoch"] * ret["batch_size"]
del ret["last_learning_rate_reduction_epoch"]
if "last_increase_batch_size_epoch" in ret:
ret["last_increase_batch_size_steps"] = ret["last_increase_batch_size_epoch"] * ret["batch_size"]
del ret["last_increase_batch_size_epoch"]
if "vali_metrics" in ret:
ret["validation_metrics"] = ret["vali_metrics"]
del ret["vali_metrics"]
for metric_group in ("train_metrics", "test_metrics", "validation_metrics"):
if metric_group not in ret:
continue
for tgt in ret[metric_group]:
for metric in ret[metric_group][tgt]:
if len(ret[metric_group][tgt][metric]) == 0 or isinstance(
ret[metric_group][tgt][metric][0], (tuple, list)
):
continue
ret[metric_group][tgt][metric] = [
TrainerMetric(i + 1, (i + 1) * ret["batch_size"], val)
for i, val in enumerate(ret[metric_group][tgt][metric])
]
if "tune_checkpoint_num" not in ret:
ret["tune_checkpoint_num"] = 0
# Upgrades related to extending progress tracker with explicit bests.
if "checkpoint_number" not in ret:
ret["checkpoint_number"] = 0
if "best_eval_metric_steps" not in ret:
ret["best_eval_metric_steps"] = 0
if "best_eval_metric_epoch" not in ret:
ret["best_eval_metric_epoch"] = 0
if "best_eval_metric_checkpoint_number" not in ret:
ret["best_eval_metric_checkpoint_number"] = 0
if "best_eval_train_metrics" not in ret:
ret["best_eval_train_metrics"] = {}
if "best_eval_validation_metrics" not in ret:
ret["best_eval_validation_metrics"] = {}
if "best_eval_test_metrics" not in ret:
ret["best_eval_test_metrics"] = {}
if "best_eval_metric" in ret:
ret["best_eval_metric_value"] = ret["best_eval_metric"]
del ret["best_eval_metric"]
if "last_improvement" in ret:
del ret["last_improvement"]
# Delete learning-rate related fields removed in https://github.com/ludwig-ai/ludwig/pull/2877.
if "best_reduce_learning_rate_eval_metric" in ret:
del ret["best_reduce_learning_rate_eval_metric"]
if "last_reduce_learning_rate_eval_metric_improvement" in ret:
del ret["last_reduce_learning_rate_eval_metric_improvement"]
return ret
def _traverse_dicts(config: Any, f: Callable[[dict], None]):
"""Recursively applies function f to every dictionary contained in config.
f should in-place modify the config dict. f will be called on leaves first, root last.
"""
if isinstance(config, dict):
for k, v in config.items():
_traverse_dicts(v, f)
f(config)
elif isinstance(config, list):
for v in config:
_traverse_dicts(v, f)
@register_config_transformation("0.6", "backend")
def _update_backend_cache_credentials(backend: dict[str, Any]) -> dict[str, Any]:
if "cache_credentials" in backend:
credentials = backend.get("credentials", {})
if "cache" in credentials:
warnings.warn("`cache` already found in `backend.credentials`, ignoring `cache_credentials`")
else:
warnings.warn(
"`backend.cache_credentials` has been renamed `backend.credentials.cache`", DeprecationWarning
)
credentials["cache"] = backend.pop("cache_credentials")
backend["credentials"] = credentials
return backend
@register_config_transformation("0.6", ["output_features"])
def update_class_weights_in_features(feature: FeatureConfigDict) -> FeatureConfigDict:
if LOSS in feature:
class_weights = feature[LOSS].get(CLASS_WEIGHTS, None)
if not isinstance(class_weights, (list, dict)):
class_weights = None
feature[LOSS][CLASS_WEIGHTS] = class_weights
return feature
@register_config_transformation("0.4")
def _update_level_metadata(config: ModelConfigDict) -> ModelConfigDict:
# Replace parameters represented as keys with params represented as values.
# Precedence is defined by first in the dictionary order, so if multiple
# provided keys map to the same value, the one that appears earlier in this
# dictionary will take priority.
drop_params = {
"sequence_length_limit": "max_sequence_length",
"word_most_common": "most_common",
"word_sequence_length_limit": "max_sequence_length",
"word_tokenizer": "tokenizer",
"word_vocab_file": "vocab_file",
"char_most_common": "most_common",
"char_sequence_length_limit": "max_sequence_length",
"char_tokenizer": "tokenizer",
"char_vocab_file": "vocab_file",
}
def upgrade_params(params):
for key, value in drop_params.items():
if key in params:
if value in params:
warnings.warn(
f"Removing deprecated config preprocessing parameter {key} as new param {value} already "
f"present in the config",
DeprecationWarning,
)
else:
warnings.warn(
f"Renaming deprecated config preprocessing parameter {key} to {value}",
DeprecationWarning,
)
params[value] = params[key]
del params[key]
sequence_types = [SEQUENCE, TEXT, AUDIO, TIMESERIES]
for dtype in sequence_types:
params = config.get(PREPROCESSING, {}).get(dtype, {})
upgrade_params(params)
for feature in config[INPUT_FEATURES]:
if feature.get(TYPE) not in sequence_types:
continue
params = feature.get(PREPROCESSING, {})
upgrade_params(params)
return config
@register_config_transformation("0.5")
def rename_training_to_trainer(config: ModelConfigDict) -> ModelConfigDict:
if TRAINING in config:
warnings.warn(
'Config section "training" renamed to "trainer" and will be removed in a future version', DeprecationWarning
)
config[TRAINER] = config[TRAINING]
del config[TRAINING]
return config
@register_config_transformation("0.5", ["input_features", "output_features"])
def _upgrade_use_bias_in_features(feature: FeatureConfigDict) -> FeatureConfigDict:
def upgrade_use_bias(config):
if BIAS in config:
warnings.warn(
'Parameter "bias" renamed to "use_bias" and will be removed in a future version', DeprecationWarning
)
config[USE_BIAS] = config[BIAS]
del config[BIAS]
if CONV_BIAS in config:
warnings.warn(
'Parameter "conv_bias" renamed to "conv_use_bias" and will be removed in a future version',
DeprecationWarning,
)
config[CONV_USE_BIAS] = config[CONV_BIAS]
del config[CONV_BIAS]
if DEFAULT_BIAS in config:
warnings.warn(
'Parameter "default_bias" renamed to "default_use_bias" and will be removed in a future version',
DeprecationWarning,
)
config[DEFAULT_USE_BIAS] = config[DEFAULT_BIAS]
del config[DEFAULT_BIAS]
_traverse_dicts(feature, upgrade_use_bias)
return feature
@register_config_transformation("0.5", ["input_features", "output_features"])
def _upgrade_feature(feature: FeatureConfigDict) -> FeatureConfigDict:
"""Upgrades feature config (in-place)"""
if feature.get(TYPE) == "numerical":
warnings.warn(
'Feature type "numerical" renamed to "number" and will be removed in a future version', DeprecationWarning
)
feature[TYPE] = NUMBER
if feature.get(TYPE) == AUDIO:
if PREPROCESSING in feature:
feature[PREPROCESSING] = upgrade_audio_preprocessing(feature[PREPROCESSING])
warnings.warn(
"Parameters specified at the `audio_feature` parameter level have been unnested and should now "
"be specified at the preprocessing level. Support for `audio_feature` will be removed in a future version",
DeprecationWarning,
)
return feature
def upgrade_audio_preprocessing(preproc_dict: PreprocessingConfigDict) -> PreprocessingConfigDict:
if "audio_feature" in preproc_dict:
for k, v in preproc_dict["audio_feature"].items():
preproc_dict[k] = v
del preproc_dict["audio_feature"]
return preproc_dict
@register_config_transformation("0.6", ["input_features"])
def _upgrade_encoder_params(feature: FeatureConfigDict) -> FeatureConfigDict:
return _upgrade_encoder_decoder_params(feature, True)
@register_config_transformation("0.6", ["output_features"])
def _upgrade_decoder_params(feature: FeatureConfigDict) -> FeatureConfigDict:
return _upgrade_encoder_decoder_params(feature, False)
def _upgrade_encoder_decoder_params(feature: FeatureConfigDict, input_feature: bool) -> FeatureConfigDict:
"""
This function nests un-nested encoder/decoder parameters to conform with the new config structure for 0.6
Args:
feature (Dict): Feature to nest encoder/decoder params for.
input_feature (Bool): Whether this feature is an input feature or not.
"""
if TYPE not in feature:
return feature
try:
if input_feature:
module_type = ENCODER
feature_cls = get_from_registry(feature[TYPE], get_input_type_registry())
else:
module_type = DECODER
feature_cls = get_from_registry(feature[TYPE], get_output_type_registry())
except ValueError:
logging.exception("Failed to obtain encoder / decoder from registry")
return feature
feature_schema_cls = feature_cls.get_schema_cls()
feature_keys = feature_schema_cls.get_valid_field_names()
# These keys have been renamed from the form below to `fc_` in the new config
fc_layer_keys = [
"fc_layers",
"output_size",
"use_bias",
"weights_initializer",
"bias_initializer",
"norm",
"norm_params",
"activation",
"dropout",
]
module = feature.get(module_type, {})
warn = False
if isinstance(module, str):
module = {TYPE: module}
feature[module_type] = module
warn = True
nested_params = []
for k, v in feature.items():
if k not in feature_keys:
module[k] = v
if k in fc_layer_keys and module_type == DECODER:
module[f"fc_{k}"] = v
nested_params.append(k)
warn = True
if module:
if module_type in feature:
feature[module_type].update(module)
else:
feature[module_type] = module
for k in nested_params:
del feature[k]
if warn:
warnings.warn(
f"{module_type} specific parameters should now be nested within a dictionary under the '{module_type}' "
f"parameter. Support for un-nested {module_type} specific parameters will be removed in a future version",
DeprecationWarning,
)
return feature
@register_config_transformation("0.5", ["hyperopt"])
def _upgrade_hyperopt(hyperopt: HyperoptConfigDict) -> HyperoptConfigDict:
"""Upgrades hyperopt config (in-place)"""
# check for use of legacy "training" reference, if any found convert to "trainer"
if PARAMETERS in hyperopt:
hparams = hyperopt[PARAMETERS]
for k, v in list(hparams.items()):
substr = "training."
if k.startswith(substr):
warnings.warn(
'Config section "training" renamed to "trainer" and will be removed in a future version',
DeprecationWarning,
)
hparams["trainer." + k[len(substr) :]] = v
del hparams[k]
# check for legacy parameters in "executor"
if EXECUTOR in hyperopt:
hpexecutor = hyperopt[EXECUTOR]
executor_type = hpexecutor.get(TYPE, None)
if executor_type is not None and executor_type != RAY:
warnings.warn(
f'executor type "{executor_type}" not supported, converted to "ray" will be flagged as error '
"in a future version",
DeprecationWarning,
)
hpexecutor[TYPE] = RAY
# if search_alg not at top level and is present in executor, promote to top level
if SEARCH_ALG in hpexecutor:
# promote only if not in top-level, otherwise use current top-level
if SEARCH_ALG not in hyperopt:
hyperopt[SEARCH_ALG] = hpexecutor[SEARCH_ALG]
if isinstance(hyperopt[SEARCH_ALG], str):
hyperopt[SEARCH_ALG] = {TYPE: hyperopt[SEARCH_ALG]}
del hpexecutor[SEARCH_ALG]
else:
warnings.warn(
'Missing "executor" section, adding "ray" executor will be flagged as error in a future version',
DeprecationWarning,
)
hyperopt[EXECUTOR] = {TYPE: RAY}
# check for legacy "sampler" section
if SAMPLER in hyperopt:
warnings.warn(
f'"{SAMPLER}" is no longer supported, converted to "{SEARCH_ALG}". "{SAMPLER}" will be flagged as '
"error in a future version",
DeprecationWarning,
)
if SEARCH_ALG in hyperopt[SAMPLER]:
if SEARCH_ALG not in hyperopt:
hyperopt[SEARCH_ALG] = hyperopt[SAMPLER][SEARCH_ALG]
if isinstance(hyperopt[SEARCH_ALG], str):
hyperopt[SEARCH_ALG] = {TYPE: hyperopt[SEARCH_ALG]}
warnings.warn('Moved "search_alg" to hyperopt config top-level', DeprecationWarning)
# if num_samples or scheduler exist in SAMPLER move to EXECUTOR Section
if NUM_SAMPLES in hyperopt[SAMPLER] and NUM_SAMPLES not in hyperopt[EXECUTOR]:
hyperopt[EXECUTOR][NUM_SAMPLES] = hyperopt[SAMPLER][NUM_SAMPLES]
warnings.warn('Moved "num_samples" from "sampler" to "executor"', DeprecationWarning)
if SCHEDULER in hyperopt[SAMPLER] and SCHEDULER not in hyperopt[EXECUTOR]:
hyperopt[EXECUTOR][SCHEDULER] = hyperopt[SAMPLER][SCHEDULER]
warnings.warn('Moved "scheduler" from "sampler" to "executor"', DeprecationWarning)
if SCHEDULER in hyperopt[EXECUTOR] and len(hyperopt[EXECUTOR][SCHEDULER].keys()) == 0:
del hyperopt[EXECUTOR][SCHEDULER]
# remove legacy section
del hyperopt[SAMPLER]
if SEARCH_ALG not in hyperopt:
# make top-level as search_alg, if missing put in default value
hyperopt[SEARCH_ALG] = {TYPE: "variant_generator"}
warnings.warn(
'Missing "search_alg" at hyperopt top-level, adding in default value, will be flagged as error '
"in a future version",
DeprecationWarning,
)
return hyperopt
@register_config_transformation("0.5", ["trainer"])
def _upgrade_trainer(trainer: TrainerConfigDict) -> TrainerConfigDict:
"""Upgrades trainer config (in-place)"""
eval_batch_size = trainer.get(EVAL_BATCH_SIZE)
if eval_batch_size == 0:
warnings.warn(
"`trainer.eval_batch_size` value `0` changed to `None`, will be unsupported in a future version",
DeprecationWarning,
)
trainer[EVAL_BATCH_SIZE] = None
return trainer
@register_config_transformation("0.5")
def _upgrade_preprocessing_defaults(config: ModelConfigDict) -> ModelConfigDict:
"""Move feature-specific preprocessing parameters into defaults in config (in-place)"""
type_specific_preprocessing_params = dict()
# If preprocessing section specified and it contains feature specific preprocessing parameters,
# make a copy and delete it from the preprocessing section
for parameter in list(config.get(PREPROCESSING, {})):
if parameter in get_base_type_registry():
warnings.warn(
f"Moving preprocessing configuration for `{parameter}` feature type from `preprocessing` section"
" to `defaults` section in Ludwig config. This will be unsupported in a future version.",
DeprecationWarning,
)
type_specific_preprocessing_params[parameter] = config[PREPROCESSING].pop(parameter)
if parameter == "numerical":
warnings.warn(
f"Moving preprocessing configuration for `{parameter}` feature type from `preprocessing` section"
" to `defaults` section in Ludwig config. This will be unsupported in a future version.",
DeprecationWarning,
)
type_specific_preprocessing_params[NUMBER] = config[PREPROCESSING].pop(parameter)
# Delete empty preprocessing section if no other preprocessing parameters specified
if PREPROCESSING in config and not config[PREPROCESSING]:
del config[PREPROCESSING]
# Update defaults with the default feature specific preprocessing parameters
defaults = config.get(DEFAULTS, {})
for feature_type, preprocessing_param in type_specific_preprocessing_params.items():
if PREPROCESSING in preprocessing_param:
preprocessing_param = preprocessing_param[PREPROCESSING]
if feature_type == AUDIO:
preprocessing_param = upgrade_audio_preprocessing(preprocessing_param)
# If defaults was empty, then create a new key with feature type
if feature_type not in defaults:
defaults[feature_type] = {PREPROCESSING: preprocessing_param}
# Feature type exists but preprocessing hasn't be specified
elif PREPROCESSING not in defaults[feature_type]:
defaults[feature_type][PREPROCESSING] = preprocessing_param
# Update default feature specific preprocessing with parameters from config
else:
defaults[feature_type][PREPROCESSING].update(
merge_dict(defaults[feature_type][PREPROCESSING], preprocessing_param)
)
if defaults:
config[DEFAULTS] = defaults
return config
@register_config_transformation("0.5", "preprocessing")
def _upgrade_preprocessing_split(preprocessing: PreprocessingConfigDict) -> PreprocessingConfigDict:
"""Upgrade split related parameters in preprocessing."""
split_params = {}
force_split = preprocessing.pop(FORCE_SPLIT, None)
split_probabilities = preprocessing.pop(SPLIT_PROBABILITIES, None)
stratify = preprocessing.pop(STRATIFY, None)
if split_probabilities is not None:
split_params[PROBABILITIES] = split_probabilities
warnings.warn(
"`preprocessing.split_probabilities` has been replaced by `preprocessing.split.probabilities`, "
"will be flagged as error in a future version",
DeprecationWarning,
)
if stratify is not None:
split_params[TYPE] = STRATIFY
split_params[COLUMN] = stratify
warnings.warn(
"`preprocessing.stratify` has been replaced by `preprocessing.split.column` "
'when setting `preprocessing.split.type` to "stratify", '
"will be flagged as error in a future version",
DeprecationWarning,
)
if force_split is not None:
warnings.warn(
"`preprocessing.force_split` has been replaced by `preprocessing.split.type`, "
"will be flagged as error in a future version",
DeprecationWarning,
)
if TYPE not in split_params:
split_params[TYPE] = RANDOM
if split_params:
preprocessing[SPLIT] = split_params
if AUDIO in preprocessing:
if "audio_feature" in preprocessing[AUDIO]:
for k, v in preprocessing[AUDIO]["audio_feature"].items():
preprocessing[AUDIO][k] = v
del preprocessing[AUDIO]["audio_feature"]
warnings.warn(
"Parameters specified at the `audio_feature` parameter level have been unnested and should now "
"be specified at the preprocessing level. Support for `audio_feature` will be removed in a future version",
DeprecationWarning,
)
return preprocessing
@register_config_transformation("0.5")
def update_training(config: ModelConfigDict) -> ModelConfigDict:
if TRAINING in config:
warnings.warn(
'Config section "training" renamed to "trainer" and will be removed in a future version', DeprecationWarning
)
config[TRAINER] = config[TRAINING]
del config[TRAINING]
return config
@register_config_transformation("0.6")
def upgrade_missing_value_strategy(config: ModelConfigDict) -> ModelConfigDict:
for input_feature in config.get(INPUT_FEATURES, []):
if _is_old_missing_value_strategy(input_feature):
_update_old_missing_value_strategy(input_feature)
for output_feature in config.get(OUTPUT_FEATURES, []):
if _is_old_missing_value_strategy(output_feature):
_update_old_missing_value_strategy(output_feature)
for feature, feature_defaults in config.get(DEFAULTS, {}).items():
if _is_old_missing_value_strategy(feature_defaults):
_update_old_missing_value_strategy(config.get(DEFAULTS).get(feature))
return config
@register_config_transformation("0.6", ["trainer"])
def _upgrade_max_batch_size(trainer: TrainerConfigDict) -> TrainerConfigDict:
if "increase_batch_size_on_plateau_max" in trainer:
warnings.warn(
'Config param "increase_batch_size_on_plateau_max" renamed to "max_batch_size" and will be '
"removed in a future version",
DeprecationWarning,
)
increase_batch_size_on_plateau_max_val = trainer.pop("increase_batch_size_on_plateau_max")
if "max_batch_size" in trainer:
warnings.warn('"max_batch_size" config param already set. Discarding "increase_batch_size_on_plateau_max".')
else:
warnings.warn(
f'Setting "max_batch_size" config param to "increase_batch_size_on_plateau_max" value '
f'({increase_batch_size_on_plateau_max_val}) and discarding "increase_batch_size_on_plateau_max"'
)
trainer["max_batch_size"] = increase_batch_size_on_plateau_max_val
return trainer
@register_config_transformation("0.6")
def remove_trainer_type(config: ModelConfigDict) -> ModelConfigDict:
# LLM Model types support different trainer types
if config.get("model_type", None) == "llm":
return config
if TYPE in config.get("trainer", {}):
warnings.warn(
"Config param `type` has been removed from the trainer. The trainer type is determined by the top level "
" `model_type` parameter. Support for the `type` params in trainer will be removed in a future version",
DeprecationWarning,
)
del config["trainer"][TYPE]
return config
@register_config_transformation("0.7", ["trainer"])
def learning_rate_scheduler(trainer: TrainerConfigDict) -> TrainerConfigDict:
key_mapping = {
"reduce_learning_rate_on_plateau": "reduce_on_plateau",
"reduce_learning_rate_on_plateau_patience": "reduce_on_plateau_patience",
"reduce_learning_rate_on_plateau_rate": "reduce_on_plateau_rate",
"reduce_learning_rate_eval_metric": "reduce_eval_metric",
"reduce_learning_rate_eval_split": "reduce_eval_split",
"decay": "decay",
"decay_steps": "decay_steps",
"decay_rate": "decay_rate",
"staircase": "staircase",
"learning_rate_warmup_epochs": "warmup_evaluations",
}
lr_scheduler = trainer.get("learning_rate_scheduler", {})
for old_key, new_key in key_mapping.items():
if old_key in trainer:
warnings.warn(
f"Config param `trainer.{old_key}` has been moved to `trainer.learning_rate_scheduler.{new_key}`.",
DeprecationWarning,
)
if new_key in lr_scheduler:
warnings.warn(
f"`trainer.learning_rate_scheduler.{new_key}` config param already set. "
f"Discarding `trainer.{old_key}`."
)
else:
value = trainer[old_key]
if old_key == "decay" and isinstance(value, bool):
# Decay has changed from a bool to an optional enum
lr_scheduler[new_key] = "exponential" if value else None
elif old_key == "reduce_learning_rate_on_plateau":
lr_scheduler[new_key] = int(value)
else:
lr_scheduler[new_key] = value
del trainer[old_key]
if lr_scheduler:
trainer["learning_rate_scheduler"] = lr_scheduler
return trainer
@register_config_transformation("0.7", ["input_features"])
def _upgrade_legacy_image_encoders(feature: FeatureConfigDict) -> FeatureConfigDict:
if feature.get(TYPE) != IMAGE:
return feature
encoder_mapping = {
"resnet": "_resnet_legacy",
"vit": "_vit_legacy",
}
encoder = feature.get(ENCODER, {})
encoder_type = encoder.get(TYPE)
if encoder_type not in encoder_mapping:
return feature
# For this version of Ludwig, only ECD supported these encoders.
new_encoder_cls = get_encoder_cls(MODEL_ECD, feature[TYPE], encoder_type)
new_encoder_fields = new_encoder_cls.get_valid_field_names()
legacy_encoder_cls = get_encoder_cls(MODEL_ECD, feature[TYPE], encoder_mapping[encoder_type])
legacy_encoder_fields = legacy_encoder_cls.get_valid_field_names()
user_fields = set(encoder.keys())
user_fields.remove(TYPE)
removed_fields = legacy_encoder_fields.difference(new_encoder_fields)
added_fields = new_encoder_fields.difference(legacy_encoder_fields)
user_legacy_fields = user_fields.intersection(removed_fields)
user_new_fields = user_fields.intersection(added_fields)
if len(user_legacy_fields) > 0:
if len(user_new_fields) > 0:
raise ValueError(
f"Intended encoder type is ambiguous. "
f"Provided encoder fields matching encoder '{encoder_type}' {user_new_fields} and "
f"legacy encoder '{encoder_mapping[encoder_type]}' {user_legacy_fields}. "
f"Please remove features unique to one of these encoder types from your configuration."
)
warnings.warn(
f"Encoder '{encoder_type}' with params '{user_legacy_fields}' has been renamed to "
f"'{encoder_mapping[encoder_type]}'. Please upgrade your config to use the new '{encoder_type}' as "
f"support for '{encoder_mapping[encoder_type]}' is not guaranteed in future versions.",
DeprecationWarning,
)
# User provided legacy fields and no new fields, so we assume they intended to use the legacy encoder
encoder[TYPE] = encoder_mapping[encoder_type]
return feature
@register_config_transformation("0.7")
def upgrade_missing_hyperopt(config: ModelConfigDict) -> ModelConfigDict:
hyperopt = config.get(HYPEROPT)
if hyperopt == {}:
# This is a deprecated form of providing a missing hyperopt section, as it violates the schema definition
warnings.warn(
"Config section `hyperopt: {}` is deprecated, please set `hyperopt: null` to disable hyperopt.",
DeprecationWarning,
)
del config[HYPEROPT]
return config
@register_config_transformation("0.7", "defaults")
def remove_extra_type_param_in_defaults_config(defaults: FeatureTypeDefaultsDict) -> FeatureTypeDefaultsDict:
"""Fixes a bug introduced before 0.7.3.
[1] and subsequent refactors accidentally introduced a bug where a `type` param was added to every feature in the
defaults config. It was removed by [2], but made it into one of the patch releases. This transformation removes that
`type` param from each section of the defaults config if it exists.
[1]: https://github.com/ludwig-ai/ludwig/pull/3223
[2]: https://github.com/ludwig-ai/ludwig/pull/3258
"""
defaults_copy = copy.deepcopy(defaults)
for feature_type, feature_config in defaults.items():
if TYPE in feature_config:
del defaults_copy[feature_type][TYPE]
return defaults_copy
def upgrade_metadata(metadata: TrainingSetMetadataDict) -> TrainingSetMetadataDict:
# TODO(travis): stopgap solution, we should make it so we don't need to do this
# by decoupling config and metadata
metadata = copy.deepcopy(metadata)
_upgrade_metadata_missing_values(metadata)
return metadata
def _upgrade_metadata_missing_values(metadata: TrainingSetMetadataDict):
for k, v in metadata.items():
if isinstance(v, dict) and _is_old_missing_value_strategy(v):
_update_old_missing_value_strategy(v)
elif isinstance(v, dict) and _is_image_feature(v):
_update_old_image_preprocessing(v)
def _update_old_missing_value_strategy(feature_config: FeatureConfigDict):
missing_value_strategy = feature_config.get(PREPROCESSING).get(MISSING_VALUE_STRATEGY)
replacement_strategy = "bfill" if missing_value_strategy == "backfill" else "ffill"
feature_name = feature_config.get(NAME)
warnings.warn(
f"Using `{replacement_strategy}` instead of `{missing_value_strategy}` as the missing value strategy"
f" for `{feature_name}`. These are identical. `{missing_value_strategy}` will be removed in a future version",
DeprecationWarning,
)
feature_config[PREPROCESSING].update({MISSING_VALUE_STRATEGY: replacement_strategy})
def _is_old_missing_value_strategy(feature_config: FeatureConfigDict):
if PREPROCESSING not in feature_config:
return False
missing_value_strategy = feature_config.get(PREPROCESSING).get(MISSING_VALUE_STRATEGY, None)
if not missing_value_strategy or missing_value_strategy not in ("backfill", "pad"):
return False
return True
def _is_image_feature(feature_config: FeatureConfigDict):
preproc = feature_config.get(PREPROCESSING, {})
return HEIGHT in preproc and WIDTH in preproc
def _update_old_image_preprocessing(feature_config: FeatureConfigDict):
preprocessing = feature_config.get(PREPROCESSING)
if not preprocessing:
return
preprocessing["standardize_image"] = preprocessing.get("standardize_image")
================================================
FILE: ludwig/utils/batch_size_tuner.py
================================================
import gc
import logging
import statistics
import time
from abc import ABC
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import MAX_BATCH_SIZE_DATASET_FRACTION, MIN_POSSIBLE_BATCH_SIZE
logger = logging.getLogger(__name__)
TOTAL_STEPS = 5
@DeveloperAPI
class BatchSizeEvaluator(ABC):
def select_best_batch_size(
self,
dataset_len: int,
max_batch_size: int | None = None,
max_trials: int = 20,
is_coordinator: bool | None = True,
global_max_sequence_length: int | None = None,
) -> int:
"""Returns optimal batch size as measured by throughput (samples / sec)."""
logger.info("Tuning batch size...")
max_batch_size = max_batch_size or dataset_len
def _is_valid_batch_size(batch_size):
# make sure that batch size is valid (e.g. less than 20% of ds size and max_batch_size)
is_smaller_than_training_set = batch_size <= MAX_BATCH_SIZE_DATASET_FRACTION * dataset_len
is_under_max_batch_size = batch_size <= max_batch_size
is_valid = is_smaller_than_training_set and is_under_max_batch_size
if not is_valid and is_coordinator:
logger.info(
f"Batch size {batch_size} is invalid, must be less than or equal to "
f"{MAX_BATCH_SIZE_DATASET_FRACTION * 100}% dataset size "
f"({int(MAX_BATCH_SIZE_DATASET_FRACTION * dataset_len)} samples "
f"of {dataset_len}) and less than or equal to max batch size {max_batch_size}"
)
return is_valid
batch_size = MIN_POSSIBLE_BATCH_SIZE
best_samples_per_sec = 0
best_batch_size = None
count = 0
while count < max_trials and _is_valid_batch_size(batch_size):
if is_coordinator:
logger.info(f"Exploring batch_size={batch_size}")
gc.collect()
try:
samples_per_sec = self.evaluate(
batch_size, total_steps=TOTAL_STEPS, global_max_sequence_length=global_max_sequence_length
)
if is_coordinator:
logger.info(f"Throughput at batch_size={batch_size}: {samples_per_sec:.5f} samples/s")
if samples_per_sec < best_samples_per_sec:
# We assume that once the throughput starts degrading, it won't go up again
if is_coordinator:
logger.info(f"Throughput decrease at batch_size={batch_size}")
break
best_samples_per_sec = samples_per_sec
best_batch_size = batch_size
count += 1
# double batch size
batch_size *= 2
except RuntimeError as e:
# PyTorch only generates Runtime errors for CUDA OOM.
gc.collect()
if "CUDA out of memory" in str(e) or isinstance(e, torch.cuda.OutOfMemoryError):
if is_coordinator:
logger.info(f"OOM at batch_size={batch_size}")
else:
# Not a CUDA error
raise
break
# Ensure that some batch size is found.
# `best_batch_size` can be None if the first batch size is invalid.
if best_batch_size is None:
if is_coordinator:
logger.info(f"Could not tune batch size, using minimum batch size of {MIN_POSSIBLE_BATCH_SIZE}")
best_batch_size = MIN_POSSIBLE_BATCH_SIZE
if is_coordinator:
logger.info(f"Selected batch_size={best_batch_size}")
return best_batch_size
def evaluate(self, batch_size: int, total_steps: int = 5, global_max_sequence_length: int | None = None) -> float:
"""Evaluates throughput of the given batch size.
Return:
Median throughput in samples / sec.
"""
durations = []
for _ in range(total_steps):
self.reset()
start_ts = time.time()
self.step(batch_size, global_max_sequence_length=global_max_sequence_length)
durations.append(time.time() - start_ts)
med_duration_s = statistics.median(durations)
if med_duration_s == 0.0:
return float("inf")
return batch_size / med_duration_s
def reset(self):
"""Called at the beginning of each evaluation step."""
def step(self, batch_size: int, global_max_sequence_length: int | None = None):
"""Called each step to evaluate the given batch size."""
raise NotImplementedError("`step` must be implemented by concrete evaluator.")
class BaseLLMBatchSizeEvaluator(BatchSizeEvaluator):
"""Base class for batch size evaluators for LLM models."""
def __init__(self, trainer):
self.trainer = trainer
self.input_feature_name, self.input_feature = list(trainer.model.input_features.items())[0]
self.output_feature_name, self.output_feature = list(trainer.model.output_features.items())[0]
# Get the length of the longest input sequence from the training data
self.input_msl = self.input_feature.input_shape[0]
if trainer.model.config_obj.input_features[0].preprocessing.max_sequence_length:
self.input_msl = trainer.model.config_obj.input_features[0].preprocessing.max_sequence_length
# Get the length of the longest output sequence from the training data
self.output_msl = self.output_feature.output_shape[0]
if trainer.model.config_obj.output_features[0].preprocessing.max_sequence_length:
self.output_msl = trainer.model.config_obj.output_features[0].preprocessing.max_sequence_length
# This is useful to create the synthetic input and target data which will be a
# random sequence of integers between 0 and vocab_size
self.vocab_size = len(trainer.model.config_obj.input_features[0].encoder.vocab)
def reset(self):
self.trainer.model.reset_metrics()
self.trainer.optimizer.zero_grad()
def step(self, batch_size: int, global_max_sequence_length: int | None = None):
if global_max_sequence_length and self.input_msl + self.output_msl > global_max_sequence_length:
# In this case, we just need to make sure that the length of the synthetic data exceeds
# max_sequence_length by at most a small amount
self.input_msl = global_max_sequence_length // 2 + 1
self.output_msl = global_max_sequence_length // 2 + 1
inputs = {
self.input_feature_name: torch.randint(0, self.vocab_size, size=(batch_size, self.input_msl))
.to(self.input_feature.input_dtype)
.to(self.trainer.device)
}
targets = {
self.output_feature_name: torch.randint(0, self.vocab_size, size=(batch_size, self.output_msl))
.to(self.output_feature.get_output_dtype())
.to(self.trainer.device)
}
self.perform_step(inputs, targets)
def perform_step(self, inputs, targets):
raise NotImplementedError("perform_step method must be implemented in subclasses")
class LLMFinetuneTrainerBatchSizeEvaluator(BaseLLMBatchSizeEvaluator):
"""Batch size evaluator for training batch size for LLM finetuning."""
def perform_step(self, inputs, targets):
self.trainer.train_step(inputs, targets)
class LLMFinetunePredictBatchSizeEvaluator(BaseLLMBatchSizeEvaluator):
"""Batch size evaluator for prediction/evaluation batch size for LLM finetuning."""
def perform_step(self, inputs, targets):
with torch.no_grad():
self.trainer.dist_model((inputs, targets))
================================================
FILE: ludwig/utils/calibration.py
================================================
#! /usr/bin/env python
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from abc import ABC, abstractmethod
from dataclasses import dataclass
import numpy as np
import torch
import torch.nn as nn
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import BINARY, CATEGORY
from ludwig.utils.registry import DEFAULT_KEYS, Registry
logger = logging.getLogger(__name__)
calibration_registry = Registry()
@DeveloperAPI
def register_calibration(name: str, features: str | list[str], default=False):
"""Registers a calibration implementation for a list of features."""
if isinstance(features, str):
features = [features]
def wrap(cls):
for feature in features:
feature_registry = calibration_registry.get(feature, {})
feature_registry[name] = cls
if default:
for key in DEFAULT_KEYS:
feature_registry[key] = cls
calibration_registry[feature] = feature_registry
return cls
return wrap
@DeveloperAPI
def get_calibration_cls(feature: str, calibration_method: str) -> type["CalibrationModule"]:
"""Get calibration class for specified feature type and calibration method."""
if not calibration_method:
return None
if feature in calibration_registry:
if calibration_method in calibration_registry[feature]:
return calibration_registry[feature][calibration_method]
else:
raise ValueError(f"Calibration method {calibration_method} not supported for {feature} output features")
else:
raise ValueError(f"Calibration not yet supported for {feature} output features")
return None
@DeveloperAPI
class ECELoss(nn.Module):
"""Calculates the Expected Calibration Error of a model.
The input to this loss is the logits of a model, NOT the softmax scores.
This divides the confidence outputs into equally-sized interval bins.
In each bin, we compute the confidence gap:
bin_gap = | avg_confidence_in_bin - accuracy_in_bin |
We then return an average of the gaps, weighted by the number of samples in each bin.
References:
Naeini, Mahdi Pakdaman, Gregory F. Cooper, and Milos Hauskrecht
"Obtaining Well Calibrated Probabilities Using Bayesian Binning." AAAI. 2015.
Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger
"On Calibration of Modern Neural Networks." PMLR 2017.
"""
def __init__(self, n_bins: int = 15):
"""n_bins (int): number of confidence interval bins."""
super().__init__()
bin_boundaries = torch.linspace(0, 1, n_bins + 1)
self.bin_lowers = bin_boundaries[:-1]
self.bin_uppers = bin_boundaries[1:]
def forward(self, logits: torch.Tensor, one_hot_labels: torch.Tensor) -> torch.Tensor:
softmaxes = nn.functional.softmax(logits, dim=1)
confidences, predictions = torch.max(softmaxes, 1)
labels = torch.argmax(one_hot_labels, 1)
accuracies = predictions.eq(labels)
ece = torch.zeros(1, device=logits.device)
for bin_lower, bin_upper in zip(self.bin_lowers, self.bin_uppers):
# Calculates |confidence - accuracy| in each bin
in_bin = confidences.gt(bin_lower.item()) * confidences.le(bin_upper.item())
prop_in_bin = in_bin.float().mean()
if prop_in_bin.item() > 0:
accuracy_in_bin = accuracies[in_bin].float().mean()
avg_confidence_in_bin = confidences[in_bin].mean()
ece += torch.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
return ece
@DeveloperAPI
@dataclass
class CalibrationResult:
"""Tracks results of probability calibration."""
before_calibration_nll: float
before_calibration_ece: float
after_calibration_nll: float
after_calibration_ece: float
@DeveloperAPI
class CalibrationModule(nn.Module, ABC):
@abstractmethod
def train_calibration(
self, logits: torch.Tensor | np.ndarray, labels: torch.Tensor | np.ndarray
) -> CalibrationResult:
"""Calibrate output probabilities using logits and labels from validation set."""
return NotImplementedError()
@DeveloperAPI
@register_calibration("temperature_scaling", [BINARY, CATEGORY], default=True)
class TemperatureScaling(CalibrationModule):
"""Implements temperature scaling of logits. Based on results from "On Calibration of Modern Neural Networks":
https://arxiv.org/abs/1706.04599. Temperature scaling scales all logits by the same constant factor. Though it
may modify output probabilities it will never change argmax or categorical top-n predictions. In the case of
binary classification with a threshold, however, calibration may change predictions.
Implementation inspired by https://github.com/gpleiss/temperature_scaling
Args:
num_classes: The number of classes. Must be 2 if binary is True.
binary: If binary is true, logits is expected to be a 1-dimensional array. If false, logits is a 2-dimensional
array of shape (num_examples, num_classes).
"""
def __init__(self, num_classes: int = 2, binary: bool = False):
super().__init__()
self.num_classes = 2 if binary else num_classes
self.binary = binary
self.device = "cuda" if torch.cuda.is_available() and torch.cuda.device_count() > 0 else "cpu"
self.temperature = nn.Parameter(torch.ones(1), requires_grad=False).to(self.device)
def train_calibration(
self, logits: torch.Tensor | np.ndarray, labels: torch.Tensor | np.ndarray
) -> CalibrationResult:
logits = torch.as_tensor(logits, dtype=torch.float32, device=self.device)
labels = torch.as_tensor(labels, dtype=torch.int64, device=self.device)
one_hot_labels = nn.functional.one_hot(labels, self.num_classes).float()
if self.binary:
# Treat binary classification as multi-class with 2 classes to re-use code.
# The math works out the same: softmax([0, a])[1] == sigmoid(a)
logits = torch.stack([torch.zeros_like(logits), logits], axis=-1)
nll_criterion = nn.CrossEntropyLoss().to(self.device)
ece_criterion = ECELoss().to(self.device)
# Saves the original temperature parameter, in case something goes wrong in optimization.
original_temperature = self.temperature.clone().detach()
self.temperature.requires_grad = True
# Calculate NLL and ECE before temperature scaling
before_calibration_nll = nll_criterion(logits, one_hot_labels).item()
before_calibration_ece = ece_criterion(logits, one_hot_labels).item()
logger.info(
"Before temperature scaling:\n"
" Negative log-likelihood: %.3f\n"
" Expected Calibration Error: %.3f" % (before_calibration_nll, before_calibration_ece)
)
# Optimizes the temperature to minimize NLL
optimizer = torch.optim.LBFGS([self.temperature], lr=0.01, max_iter=50, line_search_fn="strong_wolfe")
def eval():
optimizer.zero_grad()
loss = nll_criterion(self.scale_logits(logits), one_hot_labels)
loss.backward()
return loss
optimizer.step(eval)
# Calculate NLL and ECE after temperature scaling
after_calibration_nll = nll_criterion(self.scale_logits(logits), one_hot_labels).item()
after_calibration_ece = ece_criterion(self.scale_logits(logits), one_hot_labels).item()
logger.info("Optimal temperature: %.3f" % self.temperature.item())
logger.info(
"After temperature scaling:\n"
" Negative log-likelihood: %.3f\n"
" Expected Calibration Error: %.3f" % (after_calibration_nll, after_calibration_ece)
)
self.temperature.requires_grad = False
# This should never happen, but if expected calibration error is higher after optimizing temperature, revert.
if after_calibration_ece > before_calibration_ece:
logger.warning(
"Expected calibration error higher after scaling, "
"reverting to temperature=%.3f." % original_temperature.item()
)
with torch.no_grad():
self.temperature.data = original_temperature.data
return CalibrationResult(
before_calibration_nll, before_calibration_ece, after_calibration_nll, after_calibration_ece
)
def scale_logits(self, logits: torch.Tensor) -> torch.Tensor:
return torch.div(logits, self.temperature)
def forward(self, logits: torch.Tensor) -> torch.Tensor:
"""Converts logits to probabilities."""
scaled_logits = self.scale_logits(logits)
if self.binary:
return torch.sigmoid(scaled_logits)
else:
return torch.softmax(scaled_logits, -1)
@DeveloperAPI
@register_calibration("matrix_scaling", CATEGORY, default=False)
class MatrixScaling(CalibrationModule):
"""Implements matrix scaling of logits, as described in Beyond temperature scaling: Obtaining well-calibrated
multiclass probabilities with Dirichlet calibration https://arxiv.org/abs/1910.12656.
Unlike temperature scaling which has only one free parameter, matrix scaling has n_classes x (n_classes + 1)
parameters. Use this only with a large validation set, as matrix scaling has a tendency to overfit small datasets.
Also, unlike temperature scaling, matrix scaling can change the argmax or top-n predictions.
NOTE: Matrix Scaling is not exposed in the UI or config yet, though it may be in a future release after testing.
Args:
num_classes: The number of classes.
off_diagonal_l2: The regularization weight for off-diagonal matrix entries.
mu: The regularization weight for bias vector. Defaults to off_diagonal_l2 if not specified.
"""
def __init__(self, num_classes: int = 2, off_diagonal_l2: float = 0.01, mu: float = None):
super().__init__()
self.num_classes = num_classes
self.device = "cuda" if torch.cuda.is_available() and torch.cuda.device_count() > 0 else "cpu"
self.w = nn.Parameter(torch.eye(self.num_classes), requires_grad=False).to(self.device)
self.b = nn.Parameter(torch.zeros(self.num_classes), requires_grad=False).to(self.device)
self.off_diagonal_l2 = off_diagonal_l2
self.mu = off_diagonal_l2 if mu is None else mu
def train_calibration(
self, logits: torch.Tensor | np.ndarray, labels: torch.Tensor | np.ndarray
) -> CalibrationResult:
logits = torch.as_tensor(logits, dtype=torch.float32, device=self.device)
labels = torch.as_tensor(labels, dtype=torch.int64, device=self.device)
one_hot_labels = nn.functional.one_hot(labels, self.num_classes).float()
nll_criterion = nn.CrossEntropyLoss().to(self.device)
ece_criterion = ECELoss().to(self.device)
self.w.requires_grad = True
self.b.requires_grad = True
# Calculate NLL and ECE before temperature scaling
before_calibration_nll = nll_criterion(logits, one_hot_labels).item()
before_calibration_ece = ece_criterion(logits, one_hot_labels).item()
logger.info(
"Before matrix scaling:\n"
" Negative log-likelihood: %.3f\n"
" Expected Calibration Error: %.3f" % (before_calibration_nll, before_calibration_ece)
)
# Optimizes the linear transform to minimize NLL
optimizer = torch.optim.LBFGS([self.w, self.b], lr=0.001, max_iter=200, line_search_fn="strong_wolfe")
def eval():
optimizer.zero_grad()
loss = nll_criterion(self.scale_logits(logits), one_hot_labels) + self.regularization_terms()
loss.backward()
return loss
optimizer.step(eval)
# Calculate NLL and ECE after matrix scaling
after_calibration_nll = nll_criterion(self.scale_logits(logits), one_hot_labels).item()
after_calibration_ece = ece_criterion(self.scale_logits(logits), one_hot_labels).item()
logger.info(
"After matrix scaling:\n"
" Negative log-likelihood: %.3f\n"
" Expected Calibration Error: %.3f" % (after_calibration_nll, after_calibration_ece)
)
self.w.requires_grad = False
self.b.requires_grad = False
# This should never happen, but if expected calibration error is higher after optimizing matrix, revert.
if after_calibration_ece > before_calibration_ece:
logger.warning("Expected calibration error higher after matrix scaling, reverting to identity.")
with torch.no_grad():
self.w.data = torch.eye(self.num_classes)
self.b.data = torch.zeros(self.num_classes)
return CalibrationResult(
before_calibration_nll, before_calibration_ece, after_calibration_nll, after_calibration_ece
)
def regularization_terms(self) -> torch.Tensor:
"""Off-Diagonal and Intercept Regularisation (ODIR).
Described in "Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet
calibration"
https://proceedings.neurips.cc/paper/2019/file/8ca01ea920679a0fe3728441494041b9-Paper.pdf
"""
off_diagonal_entries = torch.masked_select(
self.w, ~torch.eye(self.num_classes, dtype=bool, device=self.w.device)
)
weight_matrix_loss = self.off_diagonal_l2 * torch.linalg.vector_norm(off_diagonal_entries)
bias_vector_loss = self.mu * torch.linalg.vector_norm(self.b, 2)
return bias_vector_loss + weight_matrix_loss
def scale_logits(self, logits: torch.Tensor) -> torch.Tensor:
return torch.matmul(self.w, logits.T).T + self.b
def forward(self, logits: torch.Tensor) -> torch.Tensor:
"""Converts logits to probabilities."""
return torch.softmax(self.scale_logits(logits), -1)
================================================
FILE: ludwig/utils/carton_utils.py
================================================
import asyncio
import importlib.util
import logging
import os
import shutil
import sys
import tempfile
import traceback
from typing import Any
import torch
from ludwig.api import LudwigModel
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import NAME
from ludwig.types import ModelConfigDict
from ludwig.utils.fs_utils import open_file
logger = logging.getLogger(__name__)
INFERENCE_MODULE_TEMPLATE = """
from typing import Any, Dict, List, Tuple, Union
import torch
from ludwig.utils.types import TorchscriptPreprocessingInput
class GeneratedInferenceModule(torch.nn.Module):
def __init__(self, inference_module):
super().__init__()
self.inference_module = inference_module
def forward(self, inputs: Dict[str, Any]):
retyped_inputs: Dict[str, TorchscriptPreprocessingInput] = {{}}
for k, v in inputs.items():
assert isinstance(v, TorchscriptPreprocessingInput)
retyped_inputs[k] = v
results = self.inference_module(retyped_inputs)
return {output_dicts}
"""
def _get_output_dicts(config: ModelConfigDict) -> str:
results = []
for feature in config["output_features"]:
name = feature[NAME]
results.append(f'"{name}": results["{name}"]["predictions"]')
return "{" + ", ".join(results) + "}"
@DeveloperAPI
def generate_carton_torchscript(model: LudwigModel):
config = model.config
inference_module = model.to_torchscript()
with tempfile.TemporaryDirectory() as tmpdir:
ts_path = os.path.join(tmpdir, "generated.py")
with open_file(ts_path, "w") as f:
f.write(
INFERENCE_MODULE_TEMPLATE.format(
output_dicts=_get_output_dicts(config),
)
)
spec = importlib.util.spec_from_file_location("generated.ts", ts_path)
gen_ts = importlib.util.module_from_spec(spec)
spec.loader.exec_module(gen_ts)
gen_module = gen_ts.GeneratedInferenceModule(inference_module)
scripted_module = torch.jit.script(gen_module)
return scripted_module
def _get_input_spec(model: LudwigModel) -> list[dict[str, Any]]:
from cartonml import TensorSpec
spec = []
for feature_name, feature in model.model.input_features.items():
metadata = model.training_set_metadata[feature_name]
spec.append(
TensorSpec(
name=feature.feature_name, dtype=feature.get_preproc_input_dtype(metadata), shape=("batch_size",)
)
)
return spec
def _get_output_spec(model: LudwigModel) -> list[dict[str, Any]]:
from cartonml import TensorSpec
spec = []
for feature_name, feature in model.model.output_features.items():
metadata = model.training_set_metadata[feature_name]
spec.append(
TensorSpec(
name=feature.feature_name, dtype=feature.get_postproc_output_dtype(metadata), shape=("batch_size",)
)
)
return spec
@DeveloperAPI
def export_carton(model: LudwigModel, carton_path: str, carton_model_name="ludwig_model"):
try:
import cartonml as carton
except ImportError:
raise RuntimeError('The "cartonml-nightly" package is not installed in your environment.')
# Generate a torchscript model
model_ts = generate_carton_torchscript(model)
with tempfile.TemporaryDirectory() as tmpdir:
# Save the model to a temp dir
input_model_path: str = os.path.join(tmpdir, "model.pt")
torch.jit.save(model_ts, input_model_path)
# carton.pack is an async function so we run it and wait until it's complete
# See https://pyo3.rs/v0.20.0/ecosystem/async-await#a-note-about-asynciorun for why we wrap it
# in another function
async def pack() -> str:
try:
return await carton.pack(
path=input_model_path,
runner_name="torchscript",
# Any 2.x.x version is okay
# TODO: improve this
required_framework_version="=2",
model_name=carton_model_name,
inputs=_get_input_spec(model),
outputs=_get_output_spec(model),
)
except Exception as e:
exception_message: str = 'An Exception inside "pack()" occurred.\n'
exception_traceback: str = traceback.format_exc()
exception_message += f'{type(e).__name__}: "{str(e)}". Traceback: "{exception_traceback}".'
sys.stderr.write(exception_message)
sys.stderr.flush()
raise ValueError(exception_message) from e # Re-raise error for calling function to handle.
try:
tmp_out_path: str = asyncio.run(pack())
# Move it to the output path
shutil.move(tmp_out_path, carton_path)
except Exception as e:
exception_message: str = 'An Exception inside "export_carton()" occurred.\n'
exception_traceback: str = traceback.format_exc()
exception_message += f'{type(e).__name__}: "{str(e)}". Traceback: "{exception_traceback}".'
sys.stderr.write(exception_message)
sys.stderr.flush()
raise SystemExit(exception_message) from e # Make sure error is fatal.
================================================
FILE: ludwig/utils/checkpoint_utils.py
================================================
"""Implements similar functionality as tf.train.Checkpoint and tf.train.CheckpointManager.
https://gist.github.com/kevinzakka/5d345421f7abefd5dbaf6a77f829e70a.
"""
import errno
import logging
import os
import re
import shutil
import signal
import tempfile
import uuid
from abc import ABC, abstractmethod
from collections.abc import Mapping
from glob import glob
from typing import Any, TYPE_CHECKING
import torch
from torch.optim import Optimizer
from ludwig.api_annotations import DeveloperAPI
from ludwig.globals import MODEL_WEIGHTS_FILE_NAME
from ludwig.modules.lr_scheduler import LRScheduler
if TYPE_CHECKING:
from ludwig.distributed.base import DistributedStrategy
from ludwig.models.base import BaseModel
logger = logging.getLogger(__name__)
LATEST = "latest"
BEST = "best"
@DeveloperAPI
def mkdir(s):
"""Create a directory if it doesn't already exist."""
if not os.path.exists(s):
os.makedirs(s)
@DeveloperAPI
def get_files(d, pattern, sort=True):
"""Return a list of files in a given directory.
Args:
d (str): The path to the directory.
pattern (str): The wildcard to filter files with.
sort (bool): Whether to sort the returned list. Assumes filenames contain a number value to sort by (tmp-001).
"""
files = glob(os.path.join(d, pattern))
files = [f for f in files if os.path.isfile(f)]
if sort:
def filter_numeric(s):
return re.sub("[^0-9]", "", s)
files.sort(key=lambda x: int(filter_numeric(os.path.basename(x).split(".")[0])))
return files
@DeveloperAPI
def get_latest_checkpoint_path(directory: str) -> str:
latest_path = os.path.join(directory, f"{LATEST}.ckpt")
if os.path.exists(latest_path):
return latest_path
# Legacy codepath for checkpoints saved by global step number
ckpts = get_files(directory, "*.ckpt")
if ckpts:
return ckpts[-1]
return None
@DeveloperAPI
class Checkpoint(ABC):
"""Save and restore model and optimizer states."""
def __init__(
self,
distributed: "DistributedStrategy",
model: "BaseModel",
optimizer: Optimizer | None = None,
scheduler: LRScheduler | None = None,
):
"""Constructor."""
self.distributed = distributed
self.model = model
self.optimizer = optimizer
self.scheduler = scheduler
self.global_step = 0
def prepare(self, directory: str):
# create checkpoint directory if it doesn't
# already exist
mkdir(directory)
@abstractmethod
def load(self, save_path: str, device: torch.device | None = None) -> bool:
pass
@abstractmethod
def get_state_for_inference(self, save_path: str, device: torch.device | None = None) -> Mapping[str, Any]:
pass
@abstractmethod
def save(self, save_path: str, global_step: int):
pass
def _get_global_step(self, state: dict[str, Any], save_path: str) -> int:
global_step = state.get("global_step")
if global_step is None:
# Legacy step detection for older checkpoint format which encoded the
# step number in the checkpoint filename.
return int(os.path.basename(save_path).split(".")[0])
return global_step
@DeveloperAPI
class MultiNodeCheckpoint(Checkpoint):
def prepare(self, directory: str):
if self.is_local_rank_0():
super().prepare(directory)
self.distributed.barrier()
def load(self, save_path: str, device: torch.device | None = None) -> bool:
"""Load state from a saved checkpoint.
Args:
save_path (str): The filepath to the saved checkpoint.
device (torch.device): The device on which to
load the state.
Returns:
True if the checkpoint was sucessfully loaded, False if the checkpoint file
could not be found.
"""
try:
state = torch.load(save_path, map_location=device)
try:
self.global_step = self._get_global_step(state, save_path)
_, unexpected_keys = self.model.load_state_dict(state[MODEL_WEIGHTS_FILE_NAME], strict=False)
assert unexpected_keys == [], f"Unexpected keys found in state dict: {unexpected_keys}"
if self.optimizer is not None:
self.optimizer.load_state_dict(state["optim_state"])
if self.scheduler is not None and "scheduler_state" in state:
self.scheduler.load_state_dict(state["scheduler_state"])
logger.info(f"Successfully loaded model weights from {save_path}.")
return True
except Exception as e:
# there was an issue loading the state which means
# either the model definition and saved weights
# do not agree or they were not saved in the first
# place.
# since this is a severe issue, we raise an error
# rather than allowing the program to proceed.
raise e
except FileNotFoundError as e:
logger.error(e)
return False
def get_state_for_inference(self, save_path: str, device: torch.device | None = None) -> Mapping[str, Any]:
state = torch.load(save_path, map_location=device)
return state[MODEL_WEIGHTS_FILE_NAME]
def save(self, save_path: str, global_step: int):
"""Save a state to disk.
Modified from brentyi/fannypack.
Args:
save_path (str): The name of the checkpoint to save.
global_step (int): The iteration number which will be used
to name the checkpoint.
"""
if self.is_local_rank_0():
state = {
"global_step": global_step,
MODEL_WEIGHTS_FILE_NAME: self.get_model_state_dict(),
}
if self.optimizer is not None:
state["optim_state"] = self.optimizer.state_dict()
if self.scheduler is not None:
state["scheduler_state"] = self.scheduler.state_dict()
# ignore ctrl+c while saving
try:
orig_handler = signal.getsignal(signal.SIGINT)
signal.signal(signal.SIGINT, lambda _sig, _frame: None)
except ValueError:
# signal throws a ValueError if we're not in the main thread
orig_handler = None
try:
# atomic save
with tempfile.TemporaryDirectory() as tmpdir:
# Save to a temporary directory outside of the checkpoint dir so
# async processes do not try and copy a partially-written checkpoint.
# See Ray Tune and MLFlow for examples of background processes that
# are affected by this.
tmp_path = os.path.join(tmpdir, "temp.ckpt")
torch.save(state, tmp_path)
self.safe_move_file(tmp_path, save_path)
logger.debug(f"Saved checkpoint at {save_path}.")
finally:
# restore SIGINT handler
if orig_handler is not None:
signal.signal(signal.SIGINT, orig_handler)
self.distributed.barrier()
def get_model_state_dict(self) -> dict[str, Any]:
state = self.model.state_dict()
# Remove frozen parameter weights from state_dict for adapters and pretrained models
for n, p in self.model.named_parameters():
if n in state and not p.requires_grad:
del state[n]
return state
def is_local_rank_0(self) -> bool:
return self.distributed.local_rank() == 0
def safe_move_file(self, src: str, dst: str):
"""Move a file from one directory to another, possibly across filesystems.
This implementation specifically addresses the following issue with distributed training:
1. The `save_path` is a directory local to the node, in which case every node should write
checkpoints separately.
2. The `save_path` is a remote / global filesystem like NFS, in which case only the coordinator
should write checkpoints.
"""
try:
os.replace(src, dst)
except OSError as err:
if err.errno == errno.EXDEV:
# Tried to move to an external filesystem. This means we should only run this on the coordinator
if not self.distributed.is_coordinator():
logger.info(
f"Skipping writing checkpoint from rank {self.distributed.rank()} as it is not the coordinator "
f"and the destination filesystem is remote."
)
return
# Generate a unique ID, and copy `` to the target directory with a temporary name `..tmp`.
# Because we're copying across a filesystem boundary, this initial copy may not be atomic. We insert a
# random UUID so if different processes are copying into ``, they don't overlap in their tmp
# copies.
copy_id = uuid.uuid4()
tmp_dst = f"{dst}.{copy_id}.tmp"
shutil.copyfile(src, tmp_dst)
# Atomic replace file onto the new name, and clean up original source file.
os.replace(tmp_dst, dst)
os.unlink(src)
else:
raise
@DeveloperAPI
class CheckpointManager:
"""A model and optimizer checkpoint manager."""
def __init__(self, checkpoint: Checkpoint, directory: str, device: torch.device):
"""Constructor.
Args:
checkpoint (Checkpoint): An instance of `Checkpoint`.
directory (str): The directory in which checkpoints will be saved.
device (torch.device): The computing device on which to restore
checkpoints.
"""
self.checkpoint = checkpoint
self.directory = directory
self.device = device
self.latest_checkpoint = None
self.checkpoint.prepare(self.directory)
def restore_or_initialize(self) -> int:
"""Restore items in checkpoint from the latest checkpoint file.
Returns:
The global iteration step. This is parsed from the latest
checkpoint file if one is found, else 0 is returned.
"""
last_ckpt = get_latest_checkpoint_path(self.directory)
if last_ckpt:
status = self.checkpoint.load(last_ckpt, self.device)
if not status:
logger.warning("Could not restore latest checkpoint file.")
return 0
self.latest_checkpoint = last_ckpt
return self.checkpoint.global_step
return 0
def save(self, global_step: int, tag: str = LATEST):
"""Create a new checkpoint.
Args:
global_step (int): The iteration number which will be used
to name the checkpoint.
"""
save_path = os.path.join(self.directory, f"{tag}.ckpt")
self.checkpoint.save(save_path, global_step)
self.latest_checkpoint = save_path
def save_best(self, global_step: int):
self.save(global_step, BEST)
def load(self, tag: str = LATEST):
"""Load a checkpoint.
Args:
tag (str): The tag of the checkpoint to load.
"""
save_path = os.path.join(self.directory, f"{tag}.ckpt")
self.checkpoint.load(save_path, self.device)
def get_best_checkpoint_state_for_inference(self, device: torch.device) -> tuple[Mapping[str, Any], None]:
save_path = os.path.join(self.directory, f"{BEST}.ckpt")
try:
return self.checkpoint.get_state_for_inference(save_path, device)
except Exception:
# This exception may be hit if the best checkpoint does not exist. This can happen if the model runs into
# NaN loss because of NaN or inf values in the weights before the first checkpoint is saved. In this case,
logger.error(f"Could not load best checkpoint state from {save_path}. Best checkpoint may not exist.")
return None
def close(self):
pass
@staticmethod
def load_latest_checkpoint(checkpoint: Checkpoint, directory: str, device: torch.device):
last_ckpt = get_latest_checkpoint_path(directory)
if last_ckpt:
checkpoint.load(last_ckpt, device)
else:
raise FileNotFoundError(f"No checkpoints found in {directory}.")
================================================
FILE: ludwig/utils/config_utils.py
================================================
from typing import Any
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import (
DECODER,
ENCODER,
IMAGE,
INPUT_FEATURES,
MODEL_ECD,
MODEL_LLM,
MODEL_TYPE,
PREPROCESSING,
SEQUENCE,
TEXT,
TIMESERIES,
TYPE,
)
from ludwig.features.feature_registries import get_input_type_registry
from ludwig.schema.model_config import ModelConfig
from ludwig.types import FeatureConfigDict, FeatureTypeDefaultsDict, PreprocessingConfigDict
@DeveloperAPI
def get_feature_type_parameter_values_from_section(
config: ModelConfig, features_section: str, feature_type: str, parameter_name: str
) -> set:
"""Returns the set of all parameter values used for the given features_section, feature_type, and
parameter_name."""
parameter_values = set()
for feature in config[features_section]:
if feature[TYPE] == feature_type:
if parameter_name in feature:
parameter_values.add(feature[parameter_name])
elif parameter_name in feature[ENCODER]:
parameter_values.add(feature[ENCODER][parameter_name])
elif parameter_name in feature[DECODER]:
parameter_values.add(feature[DECODER][parameter_name])
return parameter_values
@DeveloperAPI
def get_defaults_section_for_feature_type(
feature_type: str,
config_defaults: FeatureTypeDefaultsDict,
config_defaults_section: str,
) -> FeatureConfigDict:
"""Returns a dictionary of all default parameter values specified in the global defaults section for the
config_defaults_section of the feature_type."""
if feature_type not in config_defaults:
return {}
if config_defaults_section not in config_defaults[feature_type]:
return {}
return config_defaults[feature_type][config_defaults_section]
def _to_dict(obj) -> dict:
"""Convert a config object or dict to a plain dict."""
if isinstance(obj, dict):
return obj
return obj.to_dict()
def get_preprocessing_params(config_obj: ModelConfig) -> PreprocessingConfigDict:
"""Returns a new dictionary that merges preprocessing section of config with type-specific preprocessing
parameters from config defaults."""
preprocessing_params = {}
preprocessing_params.update(_to_dict(config_obj.preprocessing))
for feat_type in get_input_type_registry().keys():
if hasattr(config_obj.defaults, feat_type):
feat_defaults = getattr(config_obj.defaults, feat_type)
preprocessing = (
feat_defaults.preprocessing
if not isinstance(feat_defaults, dict)
else feat_defaults.get("preprocessing", {})
)
preprocessing_params[feat_type] = _to_dict(preprocessing)
return preprocessing_params
@DeveloperAPI
def merge_config_preprocessing_with_feature_specific_defaults(
config_preprocessing: PreprocessingConfigDict, config_defaults: FeatureTypeDefaultsDict
) -> PreprocessingConfigDict:
"""Returns a new dictionary that merges preprocessing section of config with type-specific preprocessing
parameters from config defaults."""
preprocessing_params = {}
preprocessing_params.update(config_preprocessing)
for feature_type in config_defaults:
preprocessing_params[feature_type] = config_defaults[feature_type].get(PREPROCESSING, {})
return preprocessing_params
def has_trainable_encoder(config: ModelConfig) -> bool:
for feature in config.input_features.to_list():
encoder = feature.get("encoder", {})
if encoder.get("trainable", False):
# TODO(travis): we assume here that False is always the default, which may not be true. We should dervice
# this from the schema.
return True
return False
def has_unstructured_input_feature(config: ModelConfig) -> bool:
for feature in config.input_features.to_list():
if feature.get("type", None) in {TEXT, IMAGE, SEQUENCE, TIMESERIES}:
return True
return False
def has_pretrained_encoder(config: ModelConfig) -> bool:
for feature in config.input_features:
if feature.encoder.is_pretrained():
return True
return False
def config_uses_llm(config: dict[str, Any] | ModelConfig) -> bool:
"""Determine if a config uses an LLM.
Args:
config: Ludwig config object or dictionary
Returns:
True if the model type is LLM or if the model uses and LLM encoder, otherwise False.
"""
uses_llm = False
# For a valid config, model_type LLM is automatically True
# ECD models need to be checked for at least one LLM text encoder
if isinstance(config, ModelConfig):
if config.model_type == MODEL_LLM:
uses_llm = True
else:
for feature in config.input_features:
if feature.encoder and feature.encoder.type == MODEL_LLM:
uses_llm = True
break
elif isinstance(config, dict) and config:
if config.get(MODEL_TYPE, MODEL_ECD) == MODEL_LLM:
uses_llm = True
elif INPUT_FEATURES in config:
for feature in config.get(INPUT_FEATURES, []):
if feature.get(ENCODER, {}).get(TYPE) == MODEL_LLM:
uses_llm = True
break
else:
raise ValueError(
"Invalid config cannot be checked for LLM usage because it has no input features." f"Config: {config}"
)
else:
raise ValueError(f"Invalid config cannot be checked for LLM usage. Config: {config}")
return uses_llm
def get_quantization(config: dict[str, Any] | ModelConfig) -> list[int | None]:
"""Get the quantization specified in a config at any level.
Args:
config: Ludwig config object or dictionary
Returns:
For LLM models, the value of quantization.bits or None if it is not specified.
For ECD models, the list of values of quantization.bits for each encoder. If the encoder does not
support quantization or no quantization config is specified, the list entry is None.
"""
if isinstance(config, ModelConfig):
if config.model_type == MODEL_LLM:
return [config.quantization.bits] if config.quantization else [None]
else:
quantization_bits = []
for feature in config.input_features:
try:
quantization = feature.encoder.quantization.bits
except AttributeError:
quantization = None
quantization_bits.append(quantization)
return quantization_bits
elif isinstance(config, dict) and config:
if config.get(MODEL_TYPE, MODEL_ECD) == MODEL_LLM:
return [config.get("quantization", {}).get("bits")]
elif INPUT_FEATURES in config:
quantization_bits = []
for feature in config.get(INPUT_FEATURES, []):
quantization_bits.append(feature.get(ENCODER, {}).get("quantization", {}).get("bits"))
return quantization_bits
else:
raise ValueError(
"Invalid config cannot be checked for quantization because it has no input features."
f"Config: {config}"
)
else:
raise ValueError(f"Invalid config cannot be checked for quantization. Config: {config}")
================================================
FILE: ludwig/utils/data_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import base64
import collections.abc
import contextlib
import csv
import dataclasses
import functools
import hashlib
import json
import logging
import os
import os.path
import pickle
import random
import re
import tempfile
import threading
from itertools import islice
from typing import Any
import numpy as np
import pandas as pd
import pyarrow as pa
import yaml
from fsspec.config import conf, set_conf_files
from pandas.errors import ParserError
from sklearn.model_selection import KFold
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import PREPROCESSING, SPLIT
from ludwig.data.cache.types import CacheableDataset
from ludwig.globals import MODEL_HYPERPARAMETERS_FILE_NAME, MODEL_WEIGHTS_FILE_NAME, TRAIN_SET_METADATA_FILE_NAME
from ludwig.utils.dataframe_utils import from_numpy_dataset, is_dask_lib, to_numpy_dataset
from ludwig.utils.fs_utils import download_h5, has_remote_protocol, open_file, upload_h5
from ludwig.utils.math_utils import cumsum
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.types import DataFrame
try:
import dask
import dask.dataframe as dd
DASK_DF_FORMATS = {dd.DataFrame}
except ImportError:
DASK_DF_FORMATS = set()
dd = None
logger = logging.getLogger(__name__)
DATASET_SPLIT_URL = "dataset_{}_fp"
DATA_PROCESSED_CACHE_DIR = "data_processed_cache_dir"
DATA_TRAIN_HDF5_FP = "data_train_hdf5_fp"
DATA_TRAIN_PARQUET_FP = "data_train_parquet_fp"
DATA_VALIDATION_PARQUET_FP = "data_validation_parquet_fp"
DATA_TEST_PARQUET_FP = "data_test_parquet_fp"
HDF5_COLUMNS_KEY = "columns"
DICT_FORMATS = {"dict", "dictionary", dict}
DATAFRAME_FORMATS = {"dataframe", "df", pd.DataFrame} | DASK_DF_FORMATS
CSV_FORMATS = {"csv"}
TSV_FORMATS = {"tsv"}
JSON_FORMATS = {"json"}
JSONL_FORMATS = {"jsonl"}
EXCEL_FORMATS = {"excel"}
PARQUET_FORMATS = {"parquet"}
PICKLE_FORMATS = {"pickle"}
FEATHER_FORMATS = {"feather"}
FWF_FORMATS = {"fwf"}
HTML_FORMATS = {"html"}
ORC_FORMATS = {"orc"}
SAS_FORMATS = {"sas"}
SPSS_FORMATS = {"spss"}
STATA_FORMATS = {"stata"}
HDF5_FORMATS = {"hdf5", "h5"}
CACHEABLE_FORMATS = set.union(
*(
CSV_FORMATS,
TSV_FORMATS,
JSON_FORMATS,
JSONL_FORMATS,
EXCEL_FORMATS,
PARQUET_FORMATS,
PICKLE_FORMATS,
FEATHER_FORMATS,
FWF_FORMATS,
HTML_FORMATS,
ORC_FORMATS,
SAS_FORMATS,
SPSS_FORMATS,
STATA_FORMATS,
DATAFRAME_FORMATS,
)
)
PANDAS_DF = pd
# Lock over the entire interpreter as we can only have one set
# of credentials scoped to the interpreter at once.
GLOBAL_CRED_LOCK = threading.Lock()
@DeveloperAPI
def get_parquet_filename(n: int):
"""Left pads the partition number with zeros to preserve order in downstream reads.
Downstream reads use the filename to determine the lexical order of the partitions.
"""
return f"part.{str(n).zfill(8)}.parquet"
@DeveloperAPI
def get_split_path(dataset_fp):
return os.path.splitext(dataset_fp)[0] + ".split.parquet"
@DeveloperAPI
def get_abs_path(src_path, file_path):
if has_remote_protocol(file_path):
return file_path
elif src_path is not None:
return os.path.join(src_path, file_path)
else:
return file_path
@DeveloperAPI
def load_csv(data_fp):
with open_file(data_fp, "rb") as f:
data = list(csv.reader(f))
return data
# Decorator used to encourage Dask on Ray to spread out data loading across workers
@DeveloperAPI
def spread(fn):
def wrapped_fn(*args, **kwargs):
if dd is None or not hasattr(dask, "annotate"):
return fn(*args, **kwargs)
with dask.annotate(ray_remote_args=dict(scheduling_strategy="SPREAD")):
return fn(*args, **kwargs)
return wrapped_fn
def read_nrows_via_chunksize(fp, read_fn, **kwargs):
chunksize = kwargs.pop("nrows", None)
ret = read_fn(fp, chunksize=chunksize, **kwargs)
if isinstance(ret, collections.abc.Iterator):
return next(ret)
return ret
@DeveloperAPI
@spread
def read_xsv(data_fp, df_lib=PANDAS_DF, separator=",", header=0, nrows=None, skiprows=None, dtype=object, **kwargs):
"""Helper method to read a csv file. Wraps around pd.read_csv to handle some exceptions. Can extend to cover
cases as necessary.
:param data_fp: path to the xsv file
:param df_lib: DataFrame library used to read in the CSV
:param separator: defaults separator to use for splitting
:param header: header argument for pandas to read the csv
:param nrows: number of rows to read from the csv, None means all
:param skiprows: number of rows to skip from the csv, None means no skips
:param dtype: dtype to use for columns. Defaults to object to disable type inference.
:return: Pandas dataframe with the data
"""
with open_file(data_fp, "r", encoding="utf8") as csvfile:
try:
dialect = csv.Sniffer().sniff(csvfile.read(1024 * 100), delimiters=[",", "\t", "|"])
separator = dialect.delimiter
except csv.Error:
# Could not conclude the delimiter, defaulting to user provided
pass
# NOTE: by default we read all XSV columns in as dtype=object, bypassing all type inference. This is to avoid silent
# issues related to incorrect type inference (e.g. NaNs in bool columns). Convert data to correct types after
# reading in.
kwargs = dict(sep=separator, header=header, skiprows=skiprows, dtype=dtype, **kwargs)
if nrows is not None:
kwargs["nrows"] = nrows
try:
df = df_lib.read_csv(data_fp, **kwargs)
except ParserError:
logger.warning("Failed to parse the CSV with pandas default way, trying \\ as escape character.")
df = df_lib.read_csv(data_fp, escapechar="\\", **kwargs)
return df
read_csv = functools.partial(read_xsv, separator=",")
read_tsv = functools.partial(read_xsv, separator="\t")
@DeveloperAPI
@spread
def read_json(data_fp, df_lib, normalize=False, **kwargs):
# Not supported unless lines=True
kwargs.pop("nrows", None)
if normalize:
return df_lib.json_normalize(load_json(data_fp))
else:
return df_lib.read_json(data_fp, **kwargs)
@DeveloperAPI
@spread
def read_jsonl(data_fp, df_lib, **kwargs):
return df_lib.read_json(data_fp, lines=True, **kwargs)
@DeveloperAPI
@spread
def read_excel(data_fp, df_lib, **kwargs):
fp_split = os.path.splitext(data_fp)
if fp_split[1] == ".xls":
excel_engine = "xlrd"
else:
excel_engine = "openpyxl"
# https://github.com/dask/dask/issues/9055
if is_dask_lib(df_lib):
logger.warning("Falling back to pd.read_excel() since dask backend does not support it")
return dd.from_pandas(pd.read_excel(data_fp, engine=excel_engine, **kwargs), npartitions=1)
return df_lib.read_excel(data_fp, engine=excel_engine, **kwargs)
@DeveloperAPI
@spread
def read_parquet(data_fp, df_lib, nrows=None, **kwargs):
if nrows is not None:
import pyarrow.parquet as pq
from ludwig.utils.fs_utils import get_fs_and_path
fs, _ = get_fs_and_path(data_fp)
dataset = pq.ParquetDataset(data_fp, filesystem=fs).fragments[0]
preview = dataset.head(nrows).to_pandas()
if is_dask_lib(df_lib):
return df_lib.from_pandas(preview, npartitions=1)
return preview
return df_lib.read_parquet(data_fp, **kwargs)
@DeveloperAPI
@spread
def read_pickle(data_fp, df_lib, **kwargs):
# Chunking is not supported for pickle files:
kwargs.pop("nrows", None)
# https://github.com/dask/dask/issues/9055
if is_dask_lib(df_lib):
logger.warning("Falling back to pd.read_pickle() since dask backend does not support it")
return dd.from_pandas(pd.read_pickle(data_fp), npartitions=1)
return df_lib.read_pickle(data_fp)
@DeveloperAPI
@spread
def read_fwf(data_fp, df_lib, **kwargs):
return df_lib.read_fwf(data_fp, **kwargs)
@DeveloperAPI
@spread
def read_feather(data_fp, df_lib, **kwargs):
# Chunking is not supported for feather files:
kwargs.pop("nrows", None)
# https://github.com/dask/dask/issues/9055
if is_dask_lib(df_lib):
logger.warning("Falling back to pd.read_feather() since dask backend does not support it")
return dd.from_pandas(pd.read_feather(data_fp), npartitions=1)
return df_lib.read_feather(data_fp)
@DeveloperAPI
@spread
def read_html(data_fp, df_lib, **kwargs):
# Chunking is not supported for html files:
kwargs.pop("nrows", None)
# Wrap literal HTML strings in StringIO to avoid pandas FutureWarning
from io import StringIO
if isinstance(data_fp, str) and not os.path.isfile(data_fp):
data_fp = StringIO(data_fp)
# https://github.com/dask/dask/issues/9055
if is_dask_lib(df_lib):
logger.warning("Falling back to pd.read_html() since dask backend does not support it")
return dd.from_pandas(pd.read_html(data_fp)[0], npartitions=1)
return df_lib.read_html(data_fp)[0]
@DeveloperAPI
@spread
def read_orc(data_fp, df_lib, **kwargs):
# Chunking is not supported for orc files:
kwargs.pop("nrows", None)
return df_lib.read_orc(data_fp, **kwargs)
@DeveloperAPI
@spread
def read_sas(data_fp, df_lib, **kwargs):
# https://github.com/dask/dask/issues/9055
if is_dask_lib(df_lib):
logger.warning("Falling back to pd.read_sas() since dask backend does not support it")
return dd.from_pandas(read_nrows_via_chunksize(data_fp, df_lib.read_sas, **kwargs), npartitions=1)
return read_nrows_via_chunksize(data_fp, df_lib.read_sas, **kwargs)
@DeveloperAPI
@spread
def read_spss(data_fp, df_lib, **kwargs):
# Chunking is not supported for spss files:
kwargs.pop("nrows", None)
# https://github.com/dask/dask/issues/9055
if is_dask_lib(df_lib):
logger.warning("Falling back to pd.read_spss() since dask backend does not support it")
return dd.from_pandas(pd.read_spss(data_fp), npartitions=1)
return df_lib.read_spss(data_fp)
@DeveloperAPI
@spread
def read_stata(data_fp, df_lib, **kwargs):
# https://github.com/dask/dask/issues/9055
if is_dask_lib(df_lib):
logger.warning("Falling back to pd.read_stata() since dask backend does not support it")
return dd.from_pandas(read_nrows_via_chunksize(data_fp, df_lib.read_stata, **kwargs), npartitions=1)
return read_nrows_via_chunksize(data_fp, df_lib.read_stata, **kwargs)
@DeveloperAPI
@spread
def read_hdf5(data_fp, **_kwargs):
return load_hdf5(data_fp, clean_cols=True)
@DeveloperAPI
@spread
def read_buffer(buf, fname):
"""Reads data in from a binary buffer by first writing the data to a temporary file, and then processes it
based on its format (hdf5, csv, tsv etc).
Useful if object is a binary buffer coming from streaming data.
"""
data_format = figure_data_format_dataset(fname)
reader_fn = data_reader_registry[data_format]
with tempfile.TemporaryDirectory() as tmpdir:
temp_name = os.path.join(tmpdir, "dataset")
with open(temp_name, "wb") as f:
f.write(buf.read())
return reader_fn(temp_name, pd)
@DeveloperAPI
@spread
def read_fname(fname, data_format=None, df_lib=pd, **kwargs):
"""This function reads data from fname using the df_lib data processing library (defaults to pandas).
Useful if you don't know the file type extension in advance.
"""
data_format = data_format or figure_data_format_dataset(fname)
reader_fn = data_reader_registry[data_format]
return reader_fn(fname, df_lib, **kwargs)
@DeveloperAPI
def save_csv(data_fp, data):
with open_file(data_fp, "w", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
for row in data:
if not isinstance(row, collections.abc.Iterable) or isinstance(row, str):
row = [row]
writer.writerow(row)
@DeveloperAPI
def csv_contains_column(data_fp, column_name):
return column_name in read_csv(data_fp, nrows=0) # only loads header
@DeveloperAPI
def load_yaml(yaml_fp):
with open_file(yaml_fp, "r") as f:
return yaml.safe_load(f)
@DeveloperAPI
def load_config_from_str(config):
"""Load the config as either a serialized string or a path to a YAML file."""
config = yaml.safe_load(config)
if isinstance(config, str):
# Assume the caller provided a path name
with open(config, encoding="utf-8") as f:
config = yaml.safe_load(f)
return config
@DeveloperAPI
def load_json(data_fp):
with open_file(data_fp, "r") as input_file:
data = json.load(input_file)
return data
@DeveloperAPI
def save_json(data_fp, data, sort_keys=True, indent=4):
with open_file(data_fp, "w") as output_file:
json.dump(data, output_file, cls=NumpyEncoder, sort_keys=sort_keys, indent=indent)
@DeveloperAPI
def hash_dict(d: dict, max_length: int | None = 6) -> bytes:
"""Function that maps a dictionary into a unique hash.
Known limitation: All values and keys of the dict must have an ordering. If not, there's no guarantee to obtain the
same hash. For instance, values that are sets will potentially lead to different hashed when run on different
machines or in different python sessions. Replacing them with sorted lists is suggested.
"""
s = json.dumps(d, cls=NumpyEncoder, sort_keys=True, ensure_ascii=True)
h = hashlib.md5(s.encode())
d = h.digest()
b = base64.b64encode(d, altchars=b"__")
return b[:max_length]
@DeveloperAPI
def to_json_dict(d):
"""Converts Python dict to pure JSON ready format."""
return json.loads(json.dumps(d, cls=NumpyEncoder))
@DeveloperAPI
def chunk_dict(data, chunk_size=100):
"""Split large dictionary into chunks.
Source: https://stackoverflow.com/a/22878842
"""
it = iter(data)
for _ in range(0, len(data), chunk_size):
yield {k: data[k] for k in islice(it, chunk_size)}
@DeveloperAPI
def flatten_dict(d, parent_key="", sep="."):
"""Based on https://www.geeksforgeeks.org/python-convert-nested-dictionary-into-flattened-dictionary/"""
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, collections.abc.MutableMapping):
items.extend(flatten_dict(v, new_key, sep=sep).items())
elif isinstance(v, list):
list_mapping = {str(i): item for i, item in enumerate(v)}
items.extend(flatten_dict(list_mapping, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
@DeveloperAPI
def save_hdf5(data_fp, data):
numpy_dataset = to_numpy_dataset(data)
with upload_h5(data_fp) as h5_file:
h5_file.create_dataset(HDF5_COLUMNS_KEY, data=np.array(data.columns.values, dtype="S"))
for column in data.columns:
h5_file.create_dataset(column, data=numpy_dataset[column])
@DeveloperAPI
def load_hdf5(data_fp, clean_cols: bool = False):
with download_h5(data_fp) as hdf5_data:
columns = [s.decode("utf-8") for s in hdf5_data[HDF5_COLUMNS_KEY][()].tolist()]
numpy_dataset = {}
for column in columns:
# Column names from training hdf5 will be in the form 'Survived_a2fv4'
np_col = column.rsplit("_", 1)[0] if clean_cols else column
numpy_dataset[np_col] = hdf5_data[column][()]
return from_numpy_dataset(numpy_dataset)
@DeveloperAPI
def load_object(object_fp):
with open_file(object_fp, "rb") as f:
return pickle.load(f)
@DeveloperAPI
def save_object(object_fp, obj):
with open_file(object_fp, "wb") as f:
pickle.dump(obj, f)
@DeveloperAPI
def load_array(data_fp, dtype=float):
list_num = []
with open_file(data_fp, "r") as input_file:
for x in input_file:
list_num.append(dtype(x.strip()))
return np.array(list_num)
@DeveloperAPI
def load_matrix(data_fp, dtype=float):
list_num = []
with open_file(data_fp, "r") as input_file:
for row in input_file:
list_num.append([dtype(elem) for elem in row.strip().split()])
return np.squeeze(np.array(list_num))
@DeveloperAPI
def save_array(data_fp, array):
with open_file(data_fp, "w") as output_file:
for x in np.nditer(array):
output_file.write(str(x) + "\n")
# TODO(shreya): Confirm types of args
@DeveloperAPI
def load_pretrained_embeddings(embeddings_path: str, vocab: list[str]) -> np.ndarray:
"""Create an embedding matrix of all words in vocab."""
embeddings, embeddings_size = load_glove(embeddings_path, return_embedding_size=True)
# calculate an average embedding, to use for initializing missing words
avg_embedding = [embeddings[w] for w in vocab if w in embeddings]
avg_embedding = sum(avg_embedding) / len(avg_embedding)
# create the embedding matrix
embeddings_vectors = []
for word in vocab:
if word in embeddings:
embeddings_vectors.append(embeddings[word])
else:
embeddings_vectors.append(avg_embedding + np.random.uniform(-0.01, 0.01, embeddings_size))
embeddings_matrix = np.stack(embeddings_vectors)
# let's help the garbage collector free some memory
embeddings = None
return embeddings_matrix
@DeveloperAPI
@functools.lru_cache(1)
def load_glove(file_path: str, return_embedding_size: bool = False) -> dict[str, np.ndarray]:
"""Loads Glove embeddings for each word.
Returns:
Mapping between word and numpy array of size embedding_size as set by
first line of file.
"""
logger.info(f" Loading Glove format file {file_path}")
embeddings = {}
embedding_size = 0
# collect embeddings size assuming the first line is correct
with open_file(file_path, "r", encoding="utf-8") as f:
found_line = False
while not found_line:
line = f.readline()
if line:
embedding_size = len(line.split()) - 1
found_line = True
# collect embeddings
with open_file(file_path, "r", encoding="utf-8") as f:
for line_number, line in enumerate(f):
if line:
try:
split = line.split()
if len(split) != embedding_size + 1:
raise ValueError(
f"Line {line_number} is of length {len(split)}, "
f"while expected length is {embedding_size + 1}."
)
word = split[0]
embedding = np.array([float(val) for val in split[-embedding_size:]])
embeddings[word] = embedding
except ValueError:
logger.warning(f"Line {line_number} in the GloVe file {file_path} is malformed, skipping it")
logger.info(f" {len(embeddings)} embeddings loaded")
if return_embedding_size:
return embeddings, embedding_size
return embeddings
@DeveloperAPI
def split_data(split: float, data: list) -> tuple[list, list]:
split_length = int(round(split * len(data)))
random.shuffle(data)
return data[:split_length], data[split_length:]
@DeveloperAPI
def split_by_slices(slices: list[Any], n: int, probabilities: list[float]) -> list[Any]:
splits = []
indices = cumsum([int(x * n) for x in probabilities])
start = 0
for end in indices:
splits.append(slices[start:end])
start = end
return splits
@DeveloperAPI
def shuffle_unison_inplace(list_of_lists, random_state=None):
if list_of_lists:
assert all(len(single_list) == len(list_of_lists[0]) for single_list in list_of_lists)
if random_state is not None:
p = random_state.permutation(len(list_of_lists[0]))
else:
p = np.random.permutation(len(list_of_lists[0]))
return [single_list[p] for single_list in list_of_lists]
return None
@DeveloperAPI
def shuffle_dict_unison_inplace(np_dict, random_state=None):
keys = list(np_dict.keys())
list_of_lists = list(np_dict.values())
# shuffle up the list of lists according to previous fct
shuffled_list = shuffle_unison_inplace(list_of_lists, random_state)
recon = {}
for ii, dkey in enumerate(keys):
recon[dkey] = shuffled_list[ii]
# we've shuffled the dictionary in place!
return recon
@DeveloperAPI
def split_dataset_ttv(dataset, split):
# Obtain distinct splits from the split column. If
# a split is not present in this set, then we can skip generating
# the dataframe for that split.
if dataset[split].dtype != int:
dataset[split] = dataset[split].astype(int)
distinct_values = dataset[split].drop_duplicates()
if hasattr(distinct_values, "compute"):
distinct_values = distinct_values.compute()
distinct_values = set(distinct_values.values.tolist())
training_set = split_dataset(dataset, split, 0) if 0 in distinct_values else None
validation_set = split_dataset(dataset, split, 1) if 1 in distinct_values else None
test_set = split_dataset(dataset, split, 2) if 2 in distinct_values else None
return training_set, test_set, validation_set
@DeveloperAPI
def split_dataset(dataset, split, value_to_split=0):
split_df = dataset[dataset[split] == value_to_split]
return split_df
@DeveloperAPI
def collapse_rare_labels(labels, labels_limit):
if labels_limit > 0:
labels[labels >= labels_limit] = labels_limit
return labels
@DeveloperAPI
def class_counts(dataset, labels_field):
return np.bincount(dataset[labels_field].flatten()).tolist()
@DeveloperAPI
def load_from_file(file_name, field=None, dtype=int, ground_truth_split=2):
"""Load experiment data from supported file formats.
Experiment data can be test/train statistics, model predictions, probability, ground truth, ground truth metadata.
:param file_name: Path to file to be loaded
:param field: Target Prediction field.
:param dtype:
:param ground_truth_split: Ground truth split filter where 0 is train 1 is validation and 2 is test split. By
default test split is used when loading ground truth from hdf5.
:return: Experiment data as array
"""
if file_name.endswith(".hdf5") and field is not None:
dataset = pd.read_hdf(file_name, key=HDF5_COLUMNS_KEY)
column = dataset[field]
array = column[dataset[SPLIT] == ground_truth_split].values # ground truth
elif file_name.endswith(".npy"):
array = np.load(file_name)
elif file_name.endswith(".csv"):
array = read_csv(file_name, header=None).values
else:
array = load_matrix(file_name, dtype)
return array
@DeveloperAPI
def replace_file_extension(file_path, extension):
"""Return a file path for a file with same name but different format. a.csv, json -> a.json a.csv, hdf5 ->
a.hdf5.
:param file_path: original file path
:param extension: file extension
:return: file path with same name but different format
"""
if file_path is None:
return None
extension = extension.strip()
if extension.startswith("."):
# Handle the case if the user calls with '.hdf5' instead of 'hdf5'
extension = extension[1:]
return os.path.splitext(file_path)[0] + "." + extension
@DeveloperAPI
def file_exists_with_diff_extension(file_path, extension):
return file_path is None or os.path.isfile(replace_file_extension(file_path, extension))
@DeveloperAPI
def add_sequence_feature_column(df, col_name, seq_length):
"""Adds a new column to the dataframe computed from an existing column. Values in the new column are space-
delimited strings composed of preceding values of the same column up to seq_length. For example values of the
i-th row of the new column will be a space-delimited string of df[col_name][i-seq_length].
:param df: input dataframe
:param col_name: column name containing sequential data
:param seq_length: length of an array of preceeding column values to use
"""
if col_name not in df.columns.values:
logger.error(f"{col_name} column does not exist")
return
new_col_name = col_name + "_feature"
if new_col_name in df.columns.values:
logger.warning(f"{new_col_name} column already exists, values will be overridden")
new_data = [None] * seq_length
old_data = np.array(df[col_name])
for i in range(seq_length, len(df)):
new_data.append(" ".join(str(j) for j in old_data[i - seq_length : i]))
df[new_col_name] = new_data
df[new_col_name] = df[new_col_name].bfill()
@DeveloperAPI
def override_in_memory_flag(input_features, override_value):
num_overrides = 0
for feature in input_features:
if PREPROCESSING in feature:
if "in_memory" in feature[PREPROCESSING]:
feature[PREPROCESSING]["in_memory"] = override_value
num_overrides += 1
return num_overrides
@DeveloperAPI
def normalize_numpy(obj):
if isinstance(obj, np.integer):
return int(obj)
elif isinstance(obj, np.floating):
return float(obj)
elif isinstance(obj, np.ndarray):
return normalize_numpy(obj.tolist())
elif isinstance(obj, list):
return [normalize_numpy(v) for v in obj]
else:
return obj
@DeveloperAPI
class NumpyEncoder(json.JSONEncoder):
"""Custom JSON encoder for handling NumPy objects.
This encoder extends the `json.JSONEncoder` class and provides
custom serialization for NumPy objects. It converts NumPy arrays,
sets, tuples, integers, floating-point numbers, booleans, and
dataclasses to their JSON serializable equivalents.
Attributes:
None
Methods:
default: Overrides the default method of `json.JSONEncoder`
to provide custom serialization for NumPy objects.
Usage:
Use this encoder when serializing objects that contain NumPy
arrays or other NumPy objects to JSON.
Example:
encoder = NumpyEncoder()
json_data = encoder.encode(data)
"""
def default(self, o):
if isinstance(o, (set, tuple)):
return list(o)
elif isinstance(o, np.bool_):
return bool(o)
elif isinstance(o, np.integer):
return int(o)
elif isinstance(o, np.floating):
return float(o)
elif isinstance(o, np.ndarray):
return o.tolist()
elif dataclasses.is_dataclass(o):
return dataclasses.asdict(o)
elif hasattr(o, "to_dict"):
return o.to_dict()
else:
return json.JSONEncoder.default(self, o)
@DeveloperAPI
def generate_kfold_splits(data_df, num_folds, random_state):
kf = KFold(n_splits=num_folds, shuffle=True, random_state=random_state)
fold_num = 0
for train_indices, test_indices in kf.split(data_df):
fold_num += 1
yield train_indices, test_indices, fold_num
@DeveloperAPI
def get_path_size(start_path, regex_accept=None, regex_reject=None):
total_size = 0
pattern_accept = re.compile(regex_accept) if regex_accept else None
pattern_reject = re.compile(regex_reject) if regex_reject else None
for dirpath, _, filenames in os.walk(start_path):
for filename in filenames:
filepath = os.path.join(dirpath, filename)
if not os.path.islink(filepath):
accepted = True
if pattern_accept:
accepted = accepted and pattern_accept.match(filename)
if pattern_reject:
accepted = accepted and not pattern_reject.match(filename)
if accepted:
total_size += os.path.getsize(filepath)
return total_size
@DeveloperAPI
def clear_data_cache():
"""Clears any cached data objects (e.g., embeddings)"""
load_glove.cache_clear()
@DeveloperAPI
def figure_data_format_dataset(dataset):
if isinstance(dataset, CacheableDataset):
return figure_data_format_dataset(dataset.unwrap())
elif isinstance(dataset, pd.DataFrame):
return pd.DataFrame
elif dd and isinstance(dataset, dd.DataFrame):
return dd.DataFrame
elif isinstance(dataset, dict):
return dict
elif isinstance(dataset, str):
dataset = dataset.strip()
if dataset.startswith("ludwig://"):
return "ludwig"
if dataset.startswith("hf://"):
return "hf"
dataset = dataset.lower()
if dataset.endswith(".csv"):
return "csv"
elif dataset.endswith(".tsv"):
return "tsv"
elif dataset.endswith(".json"):
return "json"
elif dataset.endswith(".jsonl"):
return "jsonl"
elif (
dataset.endswith(".xls")
or dataset.endswith(".xlsx")
or dataset.endswith(".xlsm")
or dataset.endswith(".xlsb")
or dataset.endswith(".odf")
or dataset.endswith(".ods")
or dataset.endswith(".odt")
):
return "excel"
elif dataset.endswith(".parquet"):
return "parquet"
elif dataset.endswith(".pickle") or dataset.endswith(".p"):
return "pickle"
elif dataset.endswith(".feather"):
return "feather"
elif dataset.endswith(".fwf"):
return "fwf"
elif dataset.endswith(".html"):
return "html"
elif dataset.endswith(".orc"):
return "orc"
elif dataset.endswith(".sas"):
return "sas"
elif dataset.endswith(".spss"):
return "spss"
elif dataset.endswith(".dta") or dataset.endswith(".stata"):
return "stata"
elif dataset.endswith(".h5") or dataset.endswith(".hdf5"):
return "hdf5"
else:
raise ValueError(f"Dataset path string {dataset} does not contain a valid extension")
else:
raise ValueError(f"Cannot figure out the format of dataset {dataset}")
@DeveloperAPI
def figure_data_format(dataset=None, training_set=None, validation_set=None, test_set=None):
if dataset is not None:
data_format = figure_data_format_dataset(dataset)
elif training_set is not None:
data_formats = [figure_data_format_dataset(training_set)]
if validation_set is not None:
data_formats.append(figure_data_format_dataset(validation_set))
if test_set is not None:
data_formats.append(figure_data_format_dataset(test_set))
data_formats_set = set(data_formats)
if len(data_formats_set) > 1:
error_message = "Datasets have different formats. Training: "
error_message += str(data_formats[0])
if validation_set:
error_message = ", Validation: "
error_message += str(data_formats[1])
if test_set:
error_message = ", Test: "
error_message += str(data_formats[-1])
raise ValueError(error_message)
else:
data_format = next(iter(data_formats_set))
else:
raise ValueError("At least one between dataset and training_set must be not None")
return data_format
@DeveloperAPI
def is_model_dir(path: str) -> bool:
hyperparameters_fn = os.path.join(path, MODEL_HYPERPARAMETERS_FILE_NAME)
ts_metadata_fn = os.path.join(path, TRAIN_SET_METADATA_FILE_NAME)
is_a_model_dir = False
if os.path.isdir(path) and os.path.isfile(hyperparameters_fn) and os.path.isfile(ts_metadata_fn):
weights_files_count = 0
for file_name in os.listdir(path):
if file_name.startswith(MODEL_WEIGHTS_FILE_NAME):
weights_files_count += 1
if weights_files_count >= 2:
is_a_model_dir = True
return is_a_model_dir
@DeveloperAPI
def ndarray2string(parm_array):
# convert numpy.ndarray to ludwig custom string format
if isinstance(parm_array, np.ndarray):
return f"__ndarray__{json.dumps(parm_array.tolist())}"
else:
raise ValueError(f"Argument must be numpy.ndarray. Instead argument found to be {type(parm_array)}")
@DeveloperAPI
def string2ndarray(parm_string):
# convert ludwig custom ndarray string to numpy.ndarray
if isinstance(parm_string, str) and parm_string[:11] == "__ndarray__":
return np.array(json.loads(parm_string[11:]))
else:
raise ValueError("Argument must be Ludwig custom string format for numpy.ndarray")
@DeveloperAPI
def is_ludwig_ndarray_string(parm_string):
# tests if parameter is a Ludwig custom ndarray string
return isinstance(parm_string, str) and parm_string[:11] == "__ndarray__"
@DeveloperAPI
def get_pa_dtype(obj: Any):
if np.isscalar(obj):
return pa.from_numpy_dtype(np.array(obj).dtype)
elif isinstance(obj, np.ndarray) or isinstance(obj, list) or isinstance(obj, tuple):
return pa.list_(get_pa_dtype(obj[0]))
else:
raise ValueError(f"Unsupported type for pyarrow dtype: {type(obj)}")
@DeveloperAPI
def get_pa_schema(df: DataFrame):
"""Gets the pyarrow schema associated with a given DataFrame.
This will fail in very specific conditions worth enumerating:
1. If the DataFrame is a Dask DataFrame which has a partition of size 1 and its only sample is a NaN, then the
`schema` dict will not contain the associated key. The value in this case will be inferred (likely incorrectly)
as a float64 downstream.
2. If the DataFrame contains NaNs in some column and the presence of NaNs changes the overall dtype of the column.
For example, if a number feature column contains some NaN-like value, then its dtype will be changed by the
below `fillna` call from float32 to float64. This will cause `to_parquet` to fail downstream.
"""
head = df.head(100)
schema = {}
for k, v in head.items():
if sum(v.isna()) > 0:
v = v.fillna(np.nan).replace([np.nan], [None]) # Only fill NaNs if they are present
v = v.values
for val in v:
if val is not None and k not in schema:
schema[k] = get_pa_dtype(val)
break
return pa.schema(list(schema.items()))
data_reader_registry = {
**{fmt: read_csv for fmt in CSV_FORMATS},
**{fmt: read_tsv for fmt in TSV_FORMATS},
**{fmt: read_json for fmt in JSON_FORMATS},
**{fmt: read_jsonl for fmt in JSONL_FORMATS},
**{fmt: read_excel for fmt in EXCEL_FORMATS},
**{fmt: read_parquet for fmt in PARQUET_FORMATS},
**{fmt: read_pickle for fmt in PICKLE_FORMATS},
**{fmt: read_fwf for fmt in FWF_FORMATS},
**{fmt: read_feather for fmt in FEATHER_FORMATS},
**{fmt: read_html for fmt in HTML_FORMATS},
**{fmt: read_orc for fmt in ORC_FORMATS},
**{fmt: read_sas for fmt in SAS_FORMATS},
**{fmt: read_spss for fmt in SPSS_FORMATS},
**{fmt: read_stata for fmt in STATA_FORMATS},
**{fmt: read_hdf5 for fmt in HDF5_FORMATS},
}
@DeveloperAPI
def load_dataset(dataset, data_format=None, df_lib=PANDAS_DF):
if not data_format or data_format == "auto":
data_format = figure_data_format(dataset)
# use appropriate reader to create dataframe
if data_format in DATAFRAME_FORMATS:
return dataset
elif data_format in DICT_FORMATS:
return pd.DataFrame(dataset)
elif data_format in CACHEABLE_FORMATS:
data_reader = get_from_registry(data_format, data_reader_registry)
return data_reader(dataset, df_lib)
else:
raise ValueError(f"{data_format} format is not supported")
@DeveloperAPI
@contextlib.contextmanager
def use_credentials(creds):
if creds is None:
with contextlib.nullcontext():
yield
return
# https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration
# This allows us to avoid having to plumb the `storage_options` kwargs through
# every remote FS call in Ludwig. This implementation is restricted to one thread
# in the process acquiring the lock at once.
with GLOBAL_CRED_LOCK:
with tempfile.TemporaryDirectory() as tmpdir:
fname = os.path.join(tmpdir, "conf.json")
with open(fname, "w", encoding="utf-8") as f:
json.dump(creds, f)
# Backup any existing credentials
old_conf = dict(**conf)
conf.clear()
set_conf_files(tmpdir, conf)
try:
yield
finally:
# Restore previous credentials
with open(fname, "w", encoding="utf-8") as f:
json.dump(old_conf, f)
conf.clear()
set_conf_files(tmpdir, conf)
def get_sanitized_feature_name(feature_name: str) -> str:
"""Replaces non-word characters (anything other than alphanumeric or _) with _.
Used in model config initialization and sanitize_column_names(), which is called during dataset building.
"""
return re.sub(r"[(){}.:\"\"\'\'\[\]]", "_", feature_name)
def sanitize_column_names(df: DataFrame) -> DataFrame:
"""Renames df columns with non-word characters (anything other than alphanumeric or _) to _."""
safe_column_names = [get_sanitized_feature_name(col) for col in df.columns]
return df.rename(columns=dict(zip(df.columns, safe_column_names)))
================================================
FILE: ludwig/utils/dataframe_utils.py
================================================
from typing import Optional
import numpy as np
import pandas as pd
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import DASK_MODULE_NAME
from ludwig.data.dataframe.base import DataFrameEngine
from ludwig.utils.types import DataFrame
@DeveloperAPI
def is_dask_lib(df_lib) -> bool:
"""Returns whether the dataframe library is dask."""
return df_lib.__name__ == DASK_MODULE_NAME
@DeveloperAPI
def is_dask_backend(backend: Optional["Backend"]) -> bool: # noqa: F821
"""Returns whether the backend's dataframe is dask."""
return backend is not None and is_dask_lib(backend.df_engine.df_lib)
@DeveloperAPI
def is_dask_series_or_df(df: DataFrame, backend: Optional["Backend"]) -> bool: # noqa: F821
if is_dask_backend(backend):
import dask.dataframe as dd
return isinstance(df, dd.Series) or isinstance(df, dd.DataFrame)
return False
@DeveloperAPI
def flatten_df(df: DataFrame, df_engine: DataFrameEngine) -> tuple[DataFrame, dict[str, tuple]]: # noqa: F821
"""Returns a flattened dataframe with a dictionary of the original shapes, keyed by dataframe columns."""
# Workaround for: https://issues.apache.org/jira/browse/ARROW-5645
column_shapes = {}
for c in df.columns:
df = df_engine.persist(df)
shape = df_engine.compute(
df_engine.map_objects(
df[c],
lambda x: np.array(x).shape,
).max()
)
if len(shape) > 1:
column_shapes[c] = shape
df[c] = df_engine.map_objects(df[c], lambda x: np.array(x).reshape(-1))
return df, column_shapes
@DeveloperAPI
def unflatten_df(df: DataFrame, column_shapes: dict[str, tuple], df_engine: DataFrameEngine) -> DataFrame: # noqa: F821
"""Returns an unflattened dataframe, the reverse of flatten_df."""
for c in df.columns:
shape = column_shapes.get(c)
if shape:
df[c] = df_engine.map_objects(df[c], lambda x: np.array(x).reshape(shape))
return df
@DeveloperAPI
def to_numpy_dataset(df: DataFrame, backend: Optional["Backend"] = None) -> dict[str, np.ndarray]: # noqa: F821
"""Returns a dictionary of numpy arrays, keyed by the columns of the given dataframe."""
# Compute Dask DataFrames to pandas first to avoid issues with extension dtypes
# (e.g. TensorDtype) that Dask-expr's metadata system cannot handle.
if backend and is_dask_backend(backend):
df = backend.df_engine.compute(df)
dataset = {}
for col in df.columns:
if len(df.index) != 0:
dataset[col] = np.stack(df[col].to_numpy())
else:
# Dataframe is empty.
# Use to_list() directly, as np.stack() requires at least one array to stack.
dataset[col] = df[col].to_list()
return dataset
@DeveloperAPI
def from_numpy_dataset(dataset) -> pd.DataFrame:
"""Returns a pandas dataframe from the dataset."""
col_mapping = {}
for k, v in dataset.items():
if len(v.shape) > 1:
# unstacking, needed for ndarrays of dimension 2 and more
(*vals,) = v
else:
# not unstacking. Needed because otherwise pandas casts types
# the way it wants, like converting a list of float32 scalats
# to a column of float64
vals = v
col_mapping[k] = vals
return pd.DataFrame.from_dict(col_mapping)
@DeveloperAPI
def set_index_name(pd_df: pd.DataFrame, name: str) -> pd.DataFrame:
pd_df.index.name = name
return pd_df
@DeveloperAPI
def to_batches(df: pd.DataFrame, batch_size: int) -> list[pd.DataFrame]:
n_rows = len(df)
return [df[i : i + batch_size].copy() for i in range(0, n_rows, batch_size)]
@DeveloperAPI
def from_batches(batches: list[pd.DataFrame]) -> pd.DataFrame:
return pd.concat(batches)
@DeveloperAPI
def to_scalar_df(df: pd.DataFrame) -> pd.DataFrame:
"""Converts all columns in a pd.DataFrame to be scalar types.
For object columns of lists, each element of the list is expanded into its own column named {column}_{index}. We
assume all object columns are lists of the same length (i.e., tensor format output from preprocessing). It's also
important that the relative order of the columns is preserved, to maintain consistency with other conversions like
the one for Hummingbird.
"""
scalar_df = df
column_ordering = []
for c, s in df.items():
if s.dtype == "object":
s_list = s.to_list()
try:
ncols = s_list[0].shape[0]
split_cols = [f"{c}_{k}" for k in range(ncols)]
sdf = pd.DataFrame(s_list, columns=split_cols)
scalar_df = pd.concat([scalar_df, sdf], axis=1)
column_ordering += split_cols
except AttributeError as e:
raise ValueError(f"Expected series of lists, but found {s_list[0]}") from e
else:
column_ordering.append(c)
return scalar_df[column_ordering]
================================================
FILE: ludwig/utils/dataset_utils.py
================================================
import pandas as pd
from sklearn.model_selection import train_test_split
from ludwig.api_annotations import PublicAPI
from ludwig.constants import TEST_SPLIT, TRAIN_SPLIT, VALIDATION_SPLIT
from ludwig.data.dataset.base import Dataset
from ludwig.utils.defaults import default_random_seed
@PublicAPI
def get_repeatable_train_val_test_split(
df_input, stratify_colname="", random_seed=default_random_seed, frac_train=0.7, frac_val=0.1, frac_test=0.2
):
"""Return df_input with split column containing (if possible) non-zero rows in the train, validation, and test
data subset categories.
If the input dataframe does not contain an existing split column or if the
number of rows in both the validation and test split is 0 and non-empty
stratify_colname specified, return df_input with split column set according
to frac_ and stratify_colname.
Else stratify_colname is ignored, and:
If the input dataframe contains an existing split column and non-zero row
counts for all three split types, return df_input.
If the input dataframe contains an existing split column but only one of
validation and test split has non-zero row counts, return df_input with
missing split getting rows from train split as per frac_.
Parameters
----------
df_input : Pandas dataframe
Input dataframe to be split.
stratify_colname : str
The column used for stratification (if desired); usually the label column.
random_seed : int
Seed used to get repeatable split.
frac_train : float
frac_val : float
frac_test : float
The ratios with which to split the dataframe into train, val, and test data;
should sum to 1.0.
Returns
-------
df_split :
Dataframe containing the three splits.
"""
if frac_train + frac_val + frac_test != 1.0:
raise ValueError(f"fractions {frac_train:f}, {frac_val:f}, {frac_test:f} do not add up to 1.0")
if stratify_colname:
do_stratify_split = True
if stratify_colname not in df_input.columns:
raise ValueError("%s is not a column in the dataframe" % (stratify_colname))
else:
do_stratify_split = False
if "split" not in df_input.columns:
df_input["split"] = 0 # set up for non-stratified split path
if "split" in df_input.columns:
df_train = df_input[df_input["split"] == TRAIN_SPLIT].copy()
df_val = df_input[df_input["split"] == VALIDATION_SPLIT].copy()
df_test = df_input[df_input["split"] == TEST_SPLIT].copy()
if not do_stratify_split or len(df_val) != 0 or len(df_test) != 0:
if len(df_val) == 0:
df_val = df_train.sample(frac=frac_val, replace=False, random_state=random_seed)
df_train = df_train.drop(df_val.index)
if len(df_test) == 0:
df_test = df_train.sample(frac=frac_test, replace=False, random_state=random_seed)
df_train = df_train.drop(df_test.index)
do_stratify_split = False
if do_stratify_split:
# Make sure the `stratify_colname` doesn't have any NaNs.
df_input = df_input[df_input[stratify_colname].notna()]
# Split original dataframe into train and temp dataframes.
y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.
df_train, df_temp, y_train, y_temp = train_test_split(
df_input, y, stratify=y, test_size=(1.0 - frac_train), random_state=random_seed
)
# Split the temp dataframe into val and test dataframes.
relative_frac_test = frac_test / (frac_val + frac_test)
df_val, df_test, y_val, y_test = train_test_split(
df_temp, y_temp, stratify=y_temp, test_size=relative_frac_test, random_state=random_seed
)
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
df_train["split"] = TRAIN_SPLIT
df_val["split"] = VALIDATION_SPLIT
df_test["split"] = TEST_SPLIT
df_split = pd.concat([df_train, df_val, df_test], ignore_index=True)
return df_split
def generate_dataset_statistics(
training_set: Dataset,
validation_set: str | dict | pd.DataFrame | Dataset | None,
test_set: str | dict | pd.DataFrame | Dataset | None,
) -> list[tuple[str, int, int]]:
from ludwig.benchmarking.utils import format_memory
dataset_statistics = [
["Dataset", "Size (Rows)", "Size (In Memory)"],
["Training", len(training_set), format_memory(training_set.in_memory_size_bytes)],
]
if validation_set is not None:
dataset_statistics.append(
["Validation", len(validation_set), format_memory(validation_set.in_memory_size_bytes)]
)
if test_set is not None:
dataset_statistics.append(["Test", len(test_set), format_memory(test_set.in_memory_size_bytes)])
return dataset_statistics
================================================
FILE: ludwig/utils/date_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import time
from datetime import date, datetime, timezone
import numpy as np
from dateutil.parser import parse, ParserError
from ludwig.api_annotations import DeveloperAPI
SCALE_S = np.floor(np.log10(time.time()))
@DeveloperAPI
def create_vector_from_datetime_obj(datetime_obj):
yearday = datetime_obj.toordinal() - date(datetime_obj.year, 1, 1).toordinal() + 1
midnight = datetime_obj.replace(hour=0, minute=0, second=0, microsecond=0)
second_of_day = (datetime_obj - midnight).seconds
return [
datetime_obj.year,
datetime_obj.month,
datetime_obj.day,
datetime_obj.weekday(),
yearday,
datetime_obj.hour,
datetime_obj.minute,
datetime_obj.second,
second_of_day,
]
@DeveloperAPI
def parse_datetime(timestamp: float | int | str) -> datetime:
"""Parse a datetime from a string or a numeric timestamp.
Args:
timestamp: A datetime string or numeric timestamp.
Returns:
A datetime representation of `timestamp`.
"""
try:
dt = parse(timestamp)
except (OverflowError, ParserError, TypeError):
dt = convert_number_to_datetime(timestamp)
return dt
@DeveloperAPI
def convert_number_to_datetime(timestamp: float | int | str) -> datetime:
"""Convert a numeric timestamp to a datetime object.
`datetime` objects can be created from POSIX timestamps like those returned by `time.time()`.
Args:
timestamp: A numeric timestamp.
Returns:
A datetime representation of `timestamp`.
Raises:
ValueError: Raised if `timestamp` is not a number or not a valid datetime.
"""
try:
timestamp = float(timestamp)
except TypeError:
raise ValueError(f"Provided value {timestamp} is not a valid numeric timestamp")
# Determine the unit of the timestamp
ts_scale = np.floor(np.log10(timestamp))
# `datetime.datetime.fromtimestamp` expects a timestamp in seconds. Rescale the timestamp if it is not in seconds.
if SCALE_S < ts_scale:
delta = ts_scale - SCALE_S
timestamp = timestamp / np.power(10, delta)
# Convert the timestamp to a datetime object. If it is not a valid timestamp, `ValueError` is raised.
dt = datetime.fromtimestamp(timestamp, tz=timezone.utc).replace(tzinfo=None)
return dt
================================================
FILE: ludwig/utils/defaults.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import copy
import logging
import yaml
from ludwig.api_annotations import DeveloperAPI
from ludwig.contrib import add_contrib_callback_args
from ludwig.features.feature_registries import get_input_type_registry
from ludwig.globals import LUDWIG_VERSION
from ludwig.schema.model_config import ModelConfig
from ludwig.schema.preprocessing import PreprocessingConfig
from ludwig.utils.backward_compatibility import upgrade_config_dict_to_latest_version
from ludwig.utils.data_utils import load_config_from_str, load_yaml
from ludwig.utils.fs_utils import open_file
from ludwig.utils.print_utils import print_ludwig
logger = logging.getLogger(__name__)
default_random_seed = 42
# Still needed for preprocessing TODO(Connor): Refactor ludwig/data/preprocessing to use schema
# TODO(travis): remove this, make type a protected string for each subclass
default_feature_specific_preprocessing_parameters = {
name: preproc_sect.get_schema_cls()(name="__tmp__", type=name).preprocessing.to_dict()
for name, preproc_sect in get_input_type_registry().items()
}
default_training_preprocessing_parameters = copy.deepcopy(default_feature_specific_preprocessing_parameters)
default_training_preprocessing_parameters.update(PreprocessingConfig().to_dict())
default_prediction_preprocessing_parameters = copy.deepcopy(default_feature_specific_preprocessing_parameters)
@DeveloperAPI
def render_config(config=None, output=None, **kwargs):
upgraded_config = upgrade_config_dict_to_latest_version(config)
output_config = ModelConfig.from_dict(upgraded_config).to_dict()
if output is None:
print(yaml.safe_dump(output_config, None, sort_keys=False))
else:
with open_file(output, "w") as f:
yaml.safe_dump(output_config, f, sort_keys=False)
@DeveloperAPI
def cli_render_config(sys_argv):
parser = argparse.ArgumentParser(
description="This script renders the full config from a user config.",
prog="ludwig render_config",
usage="%(prog)s [options]",
)
parser.add_argument(
"-c",
"--config",
type=load_yaml,
help="Path to the YAML file containing the model configuration",
)
parser.add_argument(
"-cs",
"--config_str",
dest="config",
type=load_config_from_str,
help="JSON or YAML serialized string of the model configuration",
)
parser.add_argument(
"-o",
"--output",
type=str,
help="output rendered YAML config path",
required=False,
)
add_contrib_callback_args(parser)
args = parser.parse_args(sys_argv)
args.callbacks = args.callbacks or []
for callback in args.callbacks:
callback.on_cmdline("render_config", *sys_argv)
print_ludwig("Render Config", LUDWIG_VERSION)
render_config(**vars(args))
================================================
FILE: ludwig/utils/entmax/LICENSE
================================================
MIT License
Copyright (c) 2019 DeepSPIN
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: ludwig/utils/entmax/README.md
================================================
# entmax
______________________________________________________________________
This package provides a pytorch implementation of entmax and entmax losses:
a sparse family of probability mappings and corresponding loss functions,
generalizing softmax / cross-entropy.
*Features:*
- Exact partial-sort algorithms for 1.5-entmax and 2-entmax (sparsemax).
- A bisection-based algorithm for generic alpha-entmax.
- Gradients w.r.t. alpha for adaptive, learned sparsity!
*Requirements:* python 3, pytorch >= 1.0 (and pytest for unit tests)
## Example
```python
import torch
from torch.nn.functional import softmax
from entmax import sparsemax, entmax15
x = torch.tensor([-2, 0, 0.5])
print(softmax(x, dim=0))
# tensor([0.0486, 0.3592, 0.5922])
print(sparsemax(x, dim=0))
# tensor([0.0000, 0.2500, 0.7500])
print(entmax15(x, dim=0))
# tensor([0.0000, 0.3260, 0.6740])
```
Gradients w.r.t. alpha (continued):
```python
import torch
from torch.autograd import grad
from entmax import entmax_bisect
x = torch.tensor([[-1, 0, 0.5], [1, 2, 3.5]])
alpha = torch.tensor(1.33, requires_grad=True)
p = entmax_bisect(x, alpha)
print(p)
# tensor([[0.0460, 0.3276, 0.6264],
# [0.0026, 0.1012, 0.8963]], grad_fn=)
print(grad(p[0, 0], alpha))
# (tensor(-0.2562),)
```
## Installation
```
pip install entmax
```
## Citations
[Sparse Sequence-to-Sequence Models](https://www.aclweb.org/anthology/P19-1146)
```
@inproceedings{entmax,
author = {Peters, Ben and Niculae, Vlad and Martins, Andr{\'e} FT},
title = {Sparse Sequence-to-Sequence Models},
booktitle = {Proc. ACL},
year = {2019},
url = {https://www.aclweb.org/anthology/P19-1146}
}
```
[Adaptively Sparse Transformers](https://arxiv.org/pdf/1909.00015.pdf)
```
@inproceedings{correia19adaptively,
author = {Correia, Gon\c{c}alo M and Niculae, Vlad and Martins, Andr{\'e} FT},
title = {Adaptively Sparse Transformers},
booktitle = {Proc. EMNLP-IJCNLP (to appear)},
year = {2019},
}
```
Further reading:
- Blondel, Martins, and Niculae, 2019. [Learning with Fenchel-Young Losses](https://arxiv.org/abs/1901.02324).
- Martins and Astudillo, 2016. [From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification](https://arxiv.org/abs/1602.02068).
- Peters and Martins, 2019 [IT-IST at the SIGMORPHON 2019 Shared Task: Sparse Two-headed Models for Inflection](https://www.aclweb.org/anthology/W19-4207).
================================================
FILE: ludwig/utils/entmax/__init__.py
================================================
__version__ = "1.1.dev0"
from ludwig.utils.entmax.activations import entmax15, Entmax15, sparsemax, Sparsemax
from ludwig.utils.entmax.losses import (
entmax15_loss,
Entmax15Loss,
entmax_bisect_loss,
EntmaxBisectLoss,
sparsemax_bisect_loss,
sparsemax_loss,
SparsemaxBisectLoss,
SparsemaxLoss,
)
from ludwig.utils.entmax.root_finding import entmax_bisect, EntmaxBisect, sparsemax_bisect, SparsemaxBisect
__all__ = [
"entmax15",
"Entmax15",
"sparsemax",
"Sparsemax",
"entmax15_loss",
"Entmax15Loss",
"entmax_bisect_loss",
"EntmaxBisectLoss",
"sparsemax_bisect_loss",
"sparsemax_loss",
"SparsemaxBisectLoss",
"SparsemaxLoss",
"entmax_bisect",
"EntmaxBisect",
"sparsemax_bisect",
"SparsemaxBisect",
]
================================================
FILE: ludwig/utils/entmax/activations.py
================================================
"""An implementation of entmax (Peters et al., 2019). See https://arxiv.org/pdf/1905.05702 for detailed
description.
This builds on previous work with sparsemax (Martins & Astudillo, 2016). See https://arxiv.org/pdf/1602.02068.
"""
# Author: Ben Peters
# Author: Vlad Niculae
# License: MIT
import torch
import torch.nn as nn
from torch.autograd import Function
def _make_ix_like(X, dim):
d = X.size(dim)
rho = torch.arange(1, d + 1, device=X.device, dtype=X.dtype)
view = [1] * X.dim()
view[0] = -1
return rho.view(view).transpose(0, dim)
def _roll_last(X, dim):
if dim == -1:
return X
elif dim < 0:
dim = X.dim() - dim
perm = [i for i in range(X.dim()) if i != dim] + [dim]
return X.permute(perm)
def _sparsemax_threshold_and_support(X, dim=-1, k=None):
"""Core computation for sparsemax: optimal threshold and support size.
Parameters
----------
X : torch.Tensor
The input tensor to compute thresholds over.
dim : int
The dimension along which to apply sparsemax.
k : int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If `None`, full sorting is performed from the beginning.
Returns
-------
tau : torch.Tensor like `X`, with all but the `dim` dimension intact
the threshold value for each vector
support_size : torch LongTensor, shape like `tau`
the number of nonzeros in each vector.
"""
if k is None or k >= X.shape[dim]: # do full sort
topk, _ = torch.sort(X, dim=dim, descending=True)
else:
topk, _ = torch.topk(X, k=k, dim=dim)
topk_cumsum = topk.cumsum(dim) - 1
rhos = _make_ix_like(topk, dim)
support = rhos * topk > topk_cumsum
support_size = support.sum(dim=dim).unsqueeze(dim)
tau = topk_cumsum.gather(dim, support_size - 1)
tau /= support_size.to(X.dtype)
if k is not None and k < X.shape[dim]:
unsolved = (support_size == k).squeeze(dim)
if torch.any(unsolved):
in_ = _roll_last(X, dim)[unsolved]
tau_, ss_ = _sparsemax_threshold_and_support(in_, dim=-1, k=2 * k)
_roll_last(tau, dim)[unsolved] = tau_
_roll_last(support_size, dim)[unsolved] = ss_
return tau, support_size
def _entmax_threshold_and_support(X, dim=-1, k=None):
"""Core computation for 1.5-entmax: optimal threshold and support size.
Parameters
----------
X : torch.Tensor
The input tensor to compute thresholds over.
dim : int
The dimension along which to apply 1.5-entmax.
k : int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If `None`, full sorting is performed from the beginning.
Returns
-------
tau : torch.Tensor like `X`, with all but the `dim` dimension intact
the threshold value for each vector
support_size : torch LongTensor, shape like `tau`
the number of nonzeros in each vector.
"""
if k is None or k >= X.shape[dim]: # do full sort
Xsrt, _ = torch.sort(X, dim=dim, descending=True)
else:
Xsrt, _ = torch.topk(X, k=k, dim=dim)
rho = _make_ix_like(Xsrt, dim)
mean = Xsrt.cumsum(dim) / rho
mean_sq = (Xsrt**2).cumsum(dim) / rho
ss = rho * (mean_sq - mean**2)
delta = (1 - ss) / rho
# NOTE this is not exactly the same as in reference algo
# Fortunately it seems the clamped values never wrongly
# get selected by tau <= sorted_z. Prove this!
delta_nz = torch.clamp(delta, 0)
tau = mean - torch.sqrt(delta_nz)
support_size = (tau <= Xsrt).sum(dim).unsqueeze(dim)
tau_star = tau.gather(dim, support_size - 1)
if k is not None and k < X.shape[dim]:
unsolved = (support_size == k).squeeze(dim)
if torch.any(unsolved):
X_ = _roll_last(X, dim)[unsolved]
tau_, ss_ = _entmax_threshold_and_support(X_, dim=-1, k=2 * k)
_roll_last(tau_star, dim)[unsolved] = tau_
_roll_last(support_size, dim)[unsolved] = ss_
return tau_star, support_size
class SparsemaxFunction(Function):
@classmethod
def forward(cls, ctx, X, dim=-1, k=None):
ctx.dim = dim
output, backwards_kwargs = _sparsemax_forward(X, dim, k)
ctx.save_for_backward(backwards_kwargs["supp_size"], output)
return output
@classmethod
def backward(cls, ctx, grad_output):
supp_size, output = ctx.saved_tensors
dim = ctx.dim
grad_input = grad_output.clone()
grad_input[output == 0] = 0
v_hat = grad_input.sum(dim=dim) / supp_size.to(output.dtype).squeeze(dim)
v_hat = v_hat.unsqueeze(dim)
grad_input = torch.where(output != 0, grad_input - v_hat, grad_input)
return grad_input, None, None
def _sparsemax_forward(X, dim, k):
max_val, _ = X.max(dim=dim, keepdim=True)
X = X - max_val # same numerical stability trick as softmax
tau, supp_size = _sparsemax_threshold_and_support(X, dim=dim, k=k)
output = torch.clamp(X - tau, min=0)
return output, {"supp_size": supp_size}
class Entmax15Function(Function):
@classmethod
def forward(cls, ctx, X, dim=0, k=None):
ctx.dim = dim
Y, _ = _entmax15_forward(X, dim, k)
ctx.save_for_backward(Y)
return Y
@classmethod
def backward(cls, ctx, dY):
(Y,) = ctx.saved_tensors
gppr = Y.sqrt() # = 1 / g'' (Y)
dX = dY * gppr
q = dX.sum(ctx.dim) / gppr.sum(ctx.dim)
q = q.unsqueeze(ctx.dim)
dX -= q * gppr
return dX, None, None
def _entmax15_forward(X, dim, k):
max_val, _ = X.max(dim=dim, keepdim=True)
X = X - max_val # same numerical stability trick as for softmax
X = X / 2 # divide by 2 to solve actual Entmax
tau_star, _ = _entmax_threshold_and_support(X, dim=dim, k=k)
Y = torch.clamp(X - tau_star, min=0) ** 2
return Y, {}
def sparsemax(X, dim=-1, k=None, training=True):
"""sparsemax: normalizing sparse transform (a la softmax).
Solves the projection:
min_p ||x - p||_2 s.t. p >= 0, sum(p) == 1.
Parameters
----------
X : torch.Tensor
The input tensor.
dim : int
The dimension along which to apply sparsemax.
k : int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If `None`, full sorting is performed from the beginning.
Returns
-------
P : torch tensor, same shape as X
The projection result, such that P.sum(dim=dim) == 1 elementwise.
"""
# Avoids call to custom autograd.Function during eval to ensure torchscript compatibility
# custom autograd.Function is not scriptable: https://github.com/pytorch/pytorch/issues/22329#issuecomment-506608053
if not training:
output, _ = _sparsemax_forward(X, dim, k)
return output
return SparsemaxFunction.apply(X, dim, k)
def entmax15(X, dim=-1, k=None, training=True):
"""1.5-entmax: normalizing sparse transform (a la softmax).
Solves the optimization problem:
max_p - H_1.5(p) s.t. p >= 0, sum(p) == 1.
where H_1.5(p) is the Tsallis alpha-entropy with alpha=1.5.
Parameters
----------
X : torch.Tensor
The input tensor.
dim : int
The dimension along which to apply 1.5-entmax.
k : int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If `None`, full sorting is performed from the beginning.
Returns
-------
P : torch tensor, same shape as X
The projection result, such that P.sum(dim=dim) == 1 elementwise.
"""
# Avoids call to custom autograd.Function during eval to ensure torchscript compatibility
# custom autograd.Function is not scriptable: https://github.com/pytorch/pytorch/issues/22329#issuecomment-506608053
if not training:
output, _ = _entmax15_forward(X, dim, k)
return output
return Entmax15Function.apply(X, dim, k)
class Sparsemax(nn.Module):
def __init__(self, dim=-1, k=None):
"""sparsemax: normalizing sparse transform (a la softmax).
Solves the projection:
min_p ||x - p||_2 s.t. p >= 0, sum(p) == 1.
Parameters
----------
dim : int
The dimension along which to apply sparsemax.
k : int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If `None`, full sorting is performed from the beginning.
"""
self.dim = dim
self.k = k
super().__init__()
def forward(self, X):
return sparsemax(X, dim=self.dim, k=self.k, training=self.training)
class Entmax15(nn.Module):
def __init__(self, dim=-1, k=None):
"""1.5-entmax: normalizing sparse transform (a la softmax).
Solves the optimization problem:
max_p - H_1.5(p) s.t. p >= 0, sum(p) == 1.
where H_1.5(p) is the Tsallis alpha-entropy with alpha=1.5.
Parameters
----------
dim : int
The dimension along which to apply 1.5-entmax.
k : int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If `None`, full sorting is performed from the beginning.
"""
self.dim = dim
self.k = k
super().__init__()
def forward(self, X):
return entmax15(X, dim=self.dim, k=self.k, training=self.training)
================================================
FILE: ludwig/utils/entmax/losses.py
================================================
import torch
import torch.nn as nn
from torch.autograd import Function
from ludwig.constants import IGNORE_INDEX_TOKEN_ID
from ludwig.utils.entmax.activations import entmax15, sparsemax
from ludwig.utils.entmax.root_finding import entmax_bisect, sparsemax_bisect
class _GenericLoss(nn.Module):
def __init__(self, ignore_index=IGNORE_INDEX_TOKEN_ID, reduction="elementwise_mean"):
assert reduction in ["elementwise_mean", "sum", "none"]
self.reduction = reduction
self.ignore_index = ignore_index
super().__init__()
def forward(self, X, target):
loss = self.loss(X, target)
if self.ignore_index >= 0:
ignored_positions = target == self.ignore_index
size = (target.size(0) - ignored_positions.sum()).item()
loss.masked_fill_(ignored_positions, 0.0)
else:
size = target.size(0)
if self.reduction == "sum":
loss = loss.sum()
elif self.reduction == "elementwise_mean":
if size == 0:
# Returns zero loss and zero gradient in the rare case that all row targets are ignored.
loss = loss.sum() * 0.0
else:
loss = loss.sum() / float(size)
return loss
class _GenericLossFunction(Function):
@classmethod
def forward(cls, ctx, X, target, alpha, proj_args):
"""X (FloatTensor): n x num_classes target (LongTensor): n, the indices of the target classes."""
assert X.shape[0] == target.shape[0]
p_star = cls.project(X, alpha, **proj_args)
loss = cls.omega(p_star, alpha)
p_star.scatter_add_(1, target.unsqueeze(1), torch.full_like(p_star, -1))
loss += torch.einsum("ij,ij->i", p_star, X)
ctx.save_for_backward(p_star)
return loss
@classmethod
def backward(cls, ctx, grad_output):
(p_star,) = ctx.saved_tensors
grad = grad_output.unsqueeze(1) * p_star
ret = (grad,)
# pad with as many Nones as needed
return ret + (None,) * (1 + cls.n_fwd_args)
class SparsemaxLossFunction(_GenericLossFunction):
n_fwd_args = 1
@classmethod
def project(cls, X, alpha, k):
return sparsemax(X, dim=-1, k=k)
@classmethod
def omega(cls, p_star, alpha):
return (1 - (p_star**2).sum(dim=1)) / 2
@classmethod
def forward(cls, ctx, X, target, k=None):
return super().forward(ctx, X, target, alpha=2, proj_args=dict(k=k))
class SparsemaxBisectLossFunction(_GenericLossFunction):
n_fwd_args = 1
@classmethod
def project(cls, X, alpha, n_iter):
return sparsemax_bisect(X, n_iter=n_iter)
@classmethod
def omega(cls, p_star, alpha):
return (1 - (p_star**2).sum(dim=1)) / 2
@classmethod
def forward(cls, ctx, X, target, n_iter=50):
return super().forward(ctx, X, target, alpha=2, proj_args=dict(n_iter=n_iter))
class Entmax15LossFunction(_GenericLossFunction):
n_fwd_args = 1
@classmethod
def project(cls, X, alpha, k=None):
return entmax15(X, dim=-1, k=k)
@classmethod
def omega(cls, p_star, alpha):
return (1 - (p_star * torch.sqrt(p_star)).sum(dim=1)) / 0.75
@classmethod
def forward(cls, ctx, X, target, k=None):
return super().forward(ctx, X, target, alpha=1.5, proj_args=dict(k=k))
class EntmaxBisectLossFunction(_GenericLossFunction):
n_fwd_args = 2
@classmethod
def project(cls, X, alpha, n_iter):
return entmax_bisect(X, alpha=alpha, n_iter=n_iter, ensure_sum_one=True)
@classmethod
def omega(cls, p_star, alpha):
return (1 - (p_star**alpha).sum(dim=1)) / (alpha * (alpha - 1))
@classmethod
def forward(cls, ctx, X, target, alpha=1.5, n_iter=50):
return super().forward(ctx, X, target, alpha, proj_args=dict(n_iter=n_iter))
def sparsemax_loss(X, target, k=None):
"""sparsemax loss: sparse alternative to cross-entropy.
Computed using a partial sorting strategy.
Parameters
----------
X : torch.Tensor, shape=(n_samples, n_classes)
The input 2D tensor of predicted scores
target : torch.LongTensor, shape=(n_samples,)
The ground truth labels, 0 <= target < n_classes.
k : int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If `None`, full sorting is performed from the beginning.
Returns
-------
losses, torch.Tensor, shape=(n_samples,)
The loss incurred at each sample.
"""
return SparsemaxLossFunction.apply(X, target, k)
def sparsemax_bisect_loss(X, target, n_iter=50):
"""sparsemax loss: sparse alternative to cross-entropy.
Computed using bisection.
Parameters
----------
X : torch.Tensor, shape=(n_samples, n_classes)
The input 2D tensor of predicted scores
target : torch.LongTensor, shape=(n_samples,)
The ground truth labels, 0 <= target < n_classes.
n_iter : int
Number of bisection iterations. For float32, 24 iterations should
suffice for machine precision.
Returns
-------
losses, torch.Tensor, shape=(n_samples,)
The loss incurred at each sample.
"""
return SparsemaxBisectLossFunction.apply(X, target, n_iter)
def entmax15_loss(X, target, k=None):
"""1.5-entmax loss: sparse alternative to cross-entropy
Computed using a partial sorting strategy.
Parameters
----------
X : torch.Tensor, shape=(n_samples, n_classes)
The input 2D tensor of predicted scores
target : torch.LongTensor, shape=(n_samples,)
The ground truth labels, 0 <= target < n_classes.
k : int or None
number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
nonzeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule.
If `None`, full sorting is performed from the beginning.
Returns
-------
losses, torch.Tensor, shape=(n_samples,)
The loss incurred at each sample.
"""
return Entmax15LossFunction.apply(X, target, k)
def entmax_bisect_loss(X, target, alpha=1.5, n_iter=50):
"""alpha-entmax loss: sparse alternative to cross-entropy.
Computed using bisection, supporting arbitrary alpha > 1.
Parameters
----------
X : torch.Tensor, shape=(n_samples, n_classes)
The input 2D tensor of predicted scores
target : torch.LongTensor, shape=(n_samples,)
The ground truth labels, 0 <= target < n_classes.
alpha : float or torch.Tensor
Tensor of alpha parameters (> 1) to use for each row of X. If scalar
or python float, the same value is used for all rows. A value of
alpha=2 corresponds to sparsemax, and alpha=1 would in theory recover
softmax. For numeric reasons, this algorithm does not work with `alpha=1`:
if you want softmax, we recommend `torch.nn.softmax`
n_iter : int
Number of bisection iterations. For float32, 24 iterations should
suffice for machine precision.
Returns
-------
losses, torch.Tensor, shape=(n_samples,)
The loss incurred at each sample.
"""
return EntmaxBisectLossFunction.apply(X, target, alpha, n_iter)
class SparsemaxBisectLoss(_GenericLoss):
def __init__(self, n_iter=50, ignore_index=IGNORE_INDEX_TOKEN_ID, reduction="elementwise_mean"):
self.n_iter = n_iter
super().__init__(ignore_index, reduction)
def loss(self, X, target):
return sparsemax_bisect_loss(X, target, self.n_iter)
class SparsemaxLoss(_GenericLoss):
def __init__(self, k=None, ignore_index=IGNORE_INDEX_TOKEN_ID, reduction="elementwise_mean"):
self.k = k
super().__init__(ignore_index, reduction)
def loss(self, X, target):
return sparsemax_loss(X, target, self.k)
class EntmaxBisectLoss(_GenericLoss):
def __init__(
self,
alpha=1.5,
n_iter=50,
ignore_index=IGNORE_INDEX_TOKEN_ID,
reduction="elementwise_mean",
):
self.alpha = alpha
self.n_iter = n_iter
super().__init__(ignore_index, reduction)
def loss(self, X, target):
return entmax_bisect_loss(X, target, self.alpha, self.n_iter)
class Entmax15Loss(_GenericLoss):
def __init__(self, k=100, ignore_index=IGNORE_INDEX_TOKEN_ID, reduction="elementwise_mean"):
self.k = k
super().__init__(ignore_index, reduction)
def loss(self, X, target):
return entmax15_loss(X, target, self.k)
================================================
FILE: ludwig/utils/entmax/root_finding.py
================================================
"""Bisection implementation of alpha-entmax (Peters et al., 2019).
Backward pass wrt alpha per (Correia et al., 2019). See https://arxiv.org/pdf/1905.05702 for detailed description.
"""
# Author: Goncalo M Correia
# Author: Ben Peters
# Author: Vlad Niculae
import torch
import torch.nn as nn
from torch.autograd import Function
class EntmaxBisectFunction(Function):
@classmethod
def _gp(cls, x, alpha):
return x ** (alpha - 1)
@classmethod
def _gp_inv(cls, y, alpha):
return y ** (1 / (alpha - 1))
@classmethod
def _p(cls, X, alpha):
return cls._gp_inv(torch.clamp(X, min=0), alpha)
@classmethod
def forward(cls, ctx, X, alpha=1.5, dim=-1, n_iter=50, ensure_sum_one=True):
p_m, backward_kwargs = _entmax_bisect_forward(X, alpha, dim, n_iter, ensure_sum_one, cls)
ctx.alpha = backward_kwargs["alpha"]
ctx.dim = backward_kwargs["dim"]
ctx.save_for_backward(p_m)
return p_m
@classmethod
def backward(cls, ctx, dY):
(Y,) = ctx.saved_tensors
gppr = torch.where(Y > 0, Y ** (2 - ctx.alpha), Y.new_zeros(1))
dX = dY * gppr
q = dX.sum(ctx.dim) / gppr.sum(ctx.dim)
q = q.unsqueeze(ctx.dim)
dX -= q * gppr
d_alpha = None
if ctx.needs_input_grad[1]:
# alpha gradient computation
# d_alpha = (partial_y / partial_alpha) * dY
# NOTE: ensure alpha is not close to 1
# since there is an indetermination
# batch_size, _ = dY.shape
# shannon terms
S = torch.where(Y > 0, Y * torch.log(Y), Y.new_zeros(1))
# shannon entropy
ent = S.sum(ctx.dim).unsqueeze(ctx.dim)
Y_skewed = gppr / gppr.sum(ctx.dim).unsqueeze(ctx.dim)
d_alpha = dY * (Y - Y_skewed) / ((ctx.alpha - 1) ** 2)
d_alpha -= dY * (S - Y_skewed * ent) / (ctx.alpha - 1)
d_alpha = d_alpha.sum(ctx.dim).unsqueeze(ctx.dim)
return dX, d_alpha, None, None, None
def _entmax_bisect_forward(X, alpha, dim, n_iter, ensure_sum_one, cls=EntmaxBisectFunction):
if not isinstance(alpha, torch.Tensor):
alpha = torch.tensor(alpha, dtype=X.dtype, device=X.device)
alpha_shape = list(X.shape)
alpha_shape[dim] = 1
alpha = alpha.expand(*alpha_shape)
d = X.shape[dim]
max_val, _ = X.max(dim=dim, keepdim=True)
X = X * (alpha - 1)
max_val = max_val * (alpha - 1)
# Note: when alpha < 1, tau_lo > tau_hi. This still works since dm < 0.
tau_lo = max_val - cls._gp(1, alpha)
tau_hi = max_val - cls._gp(1 / d, alpha)
f_lo = cls._p(X - tau_lo, alpha).sum(dim) - 1
dm = tau_hi - tau_lo
for it in range(n_iter):
dm /= 2
tau_m = tau_lo + dm
p_m = cls._p(X - tau_m, alpha)
f_m = p_m.sum(dim) - 1
mask = (f_m * f_lo >= 0).unsqueeze(dim)
tau_lo = torch.where(mask, tau_m, tau_lo)
if ensure_sum_one:
p_m /= p_m.sum(dim=dim).unsqueeze(dim=dim)
return p_m, {"alpha": alpha, "dim": dim}
# slightly more efficient special case for sparsemax
class SparsemaxBisectFunction(EntmaxBisectFunction):
@classmethod
def _gp(cls, x, alpha):
return x
@classmethod
def _gp_inv(cls, y, alpha):
return y
@classmethod
def _p(cls, x, alpha):
return torch.clamp(x, min=0)
@classmethod
def forward(cls, ctx, X, dim=-1, n_iter=50, ensure_sum_one=True):
p_m, backward_kwargs = _sparsemax_bisect_forward(X, dim, n_iter, ensure_sum_one)
ctx.alpha = backward_kwargs["alpha"]
ctx.dim = backward_kwargs["dim"]
ctx.save_for_backward(p_m)
return p_m
@classmethod
def backward(cls, ctx, dY):
(Y,) = ctx.saved_tensors
gppr = (Y > 0).to(dtype=dY.dtype)
dX = dY * gppr
q = dX.sum(ctx.dim) / gppr.sum(ctx.dim)
q = q.unsqueeze(ctx.dim)
dX -= q * gppr
return dX, None, None, None
def _sparsemax_bisect_forward(X, dim, n_iter, ensure_sum_one):
return _entmax_bisect_forward(X, alpha=2, dim=dim, n_iter=50, ensure_sum_one=True, cls=SparsemaxBisectFunction)
def entmax_bisect(X, alpha=1.5, dim=-1, n_iter=50, ensure_sum_one=True, training=True):
"""alpha-entmax: normalizing sparse transform (a la softmax).
Solves the optimization problem:
max_p - H_a(p) s.t. p >= 0, sum(p) == 1.
where H_a(p) is the Tsallis alpha-entropy with custom alpha >= 1,
using a bisection (root finding, binary search) algorithm.
This function is differentiable with respect to both X and alpha.
Parameters
----------
X : torch.Tensor
The input tensor.
alpha : float or torch.Tensor
Tensor of alpha parameters (> 1) to use. If scalar
or python float, the same value is used for all rows, otherwise,
it must have shape (or be expandable to)
alpha.shape[j] == (X.shape[j] if j != dim else 1)
A value of alpha=2 corresponds to sparsemax, and alpha=1 would in theory recover
softmax. For numeric reasons, this algorithm does not work with `alpha=1`: if you
want softmax, we recommend `torch.nn.softmax`.
dim : int
The dimension along which to apply alpha-entmax.
n_iter : int
Number of bisection iterations. For float32, 24 iterations should
suffice for machine precision.
ensure_sum_one : bool,
Whether to divide the result by its sum. If false, the result might
sum to close but not exactly 1, which might cause downstream problems.
Returns
-------
P : torch tensor, same shape as X
The projection result, such that P.sum(dim=dim) == 1 elementwise.
"""
# Avoids call to custom autograd.Function during eval to ensure torchscript compatibility
# custom autograd.Function is not scriptable: https://github.com/pytorch/pytorch/issues/22329#issuecomment-506608053
if not training:
output, _ = _entmax_bisect_forward(X, alpha, dim, n_iter, ensure_sum_one)
return output
return EntmaxBisectFunction.apply(X, alpha, dim, n_iter, ensure_sum_one)
def sparsemax_bisect(X, dim=-1, n_iter=50, ensure_sum_one=True, training=True):
"""sparsemax: normalizing sparse transform (a la softmax), via bisection.
Solves the projection:
min_p ||x - p||_2 s.t. p >= 0, sum(p) == 1.
Parameters
----------
X : torch.Tensor
The input tensor.
dim : int
The dimension along which to apply sparsemax.
n_iter : int
Number of bisection iterations. For float32, 24 iterations should
suffice for machine precision.
ensure_sum_one : bool,
Whether to divide the result by its sum. If false, the result might
sum to close but not exactly 1, which might cause downstream problems.
Note: This function does not yet support normalizing along anything except
the last dimension. Please use transposing and views to achieve more
general behavior.
Returns
-------
P : torch tensor, same shape as X
The projection result, such that P.sum(dim=dim) == 1 elementwise.
"""
# Avoids call to custom autograd.Function during eval to ensure torchscript compatibility
# custom autograd.Function is not scriptable: https://github.com/pytorch/pytorch/issues/22329#issuecomment-506608053
if not training:
output, _ = _sparsemax_bisect_forward(X, dim, n_iter, ensure_sum_one)
return output
return SparsemaxBisectFunction.apply(X, dim, n_iter, ensure_sum_one)
class SparsemaxBisect(nn.Module):
def __init__(self, dim=-1, n_iter=None):
"""sparsemax: normalizing sparse transform (a la softmax) via bisection
Solves the projection:
min_p ||x - p||_2 s.t. p >= 0, sum(p) == 1.
Parameters
----------
dim : int
The dimension along which to apply sparsemax.
n_iter : int
Number of bisection iterations. For float32, 24 iterations should
suffice for machine precision.
"""
self.dim = dim
self.n_iter = n_iter
super().__init__()
def forward(self, X):
return sparsemax_bisect(X, dim=self.dim, n_iter=self.n_iter, training=self.training)
class EntmaxBisect(nn.Module):
def __init__(self, alpha=1.5, dim=-1, n_iter=50):
"""alpha-entmax: normalizing sparse map (a la softmax) via bisection.
Solves the optimization problem:
max_p - H_a(p) s.t. p >= 0, sum(p) == 1.
where H_a(p) is the Tsallis alpha-entropy with custom alpha >= 1,
using a bisection (root finding, binary search) algorithm.
Parameters
----------
alpha : float or torch.Tensor
Tensor of alpha parameters (> 1) to use. If scalar
or python float, the same value is used for all rows, otherwise,
it must have shape (or be expandable to)
alpha.shape[j] == (X.shape[j] if j != dim else 1)
A value of alpha=2 corresponds to sparsemax; and alpha=1 would in theory recover
softmax. For numeric reasons, this algorithm does not work with `alpha=1`; if you
want softmax, we recommend `torch.nn.softmax`.
dim : int
The dimension along which to apply alpha-entmax.
n_iter : int
Number of bisection iterations. For float32, 24 iterations should
suffice for machine precision.
"""
super().__init__()
self.dim = dim
self.n_iter = n_iter
if isinstance(alpha, torch.Tensor):
self.register_buffer("alpha", alpha)
else:
self.alpha = alpha
def forward(self, X):
return entmax_bisect(X, alpha=self.alpha, dim=self.dim, n_iter=self.n_iter, training=self.training)
================================================
FILE: ludwig/utils/error_handling_utils.py
================================================
import logging
from functools import partial
from retry.api import retry, retry_call
import ludwig.constants as const
logger = logging.getLogger(__name__)
default_retry_call = partial(
retry_call, tries=const.TRIES, backoff=const.BACKOFF, delay=const.DELAY, jitter=const.JITTER, logger=logger
)
default_retry = partial(
retry, tries=const.TRIES, backoff=const.BACKOFF, delay=const.DELAY, jitter=const.JITTER, logger=logger
)
================================================
FILE: ludwig/utils/eval_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from collections import OrderedDict
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
logger = logging.getLogger(__name__)
class ConfusionMatrix:
def __init__(self, conditions, predictions, labels=None, sample_weight=None):
# assert (len(predictions) == len(conditions))
min_length = min(len(predictions), len(conditions))
self.predictions = predictions[:min_length]
self.conditions = conditions[:min_length]
if labels is not None:
self.label2idx = {label: idx for idx, label in enumerate(labels)}
self.idx2label = {idx: label for idx, label in enumerate(labels)}
labels = list(range(len(labels)))
else:
self.label2idx = {
str(label): idx for idx, label in enumerate(np.unique([self.predictions, self.conditions]))
}
self.idx2label = {
idx: str(label) for idx, label in enumerate(np.unique([self.predictions, self.conditions]))
}
self.cm = confusion_matrix(self.conditions, self.predictions, labels=labels, sample_weight=sample_weight)
# if labels is not None:
# self.labels_dict = {label: idx for idx, label in enumerate(labels)}
# else:
# if conditions.dtype.char == 'S': # it's an array of strings
# self.labels_dict = {str(label): idx for idx, label in
# enumerate(np.unique([predictions, conditions]))}
# else: # number
# max_label = np.concatenate([predictions, conditions]).max()
# self.labels_dict = {str(i): i for i in range(max_label + 1)}
# labels = [str(i) for i in range(max_label + 1)]
# self.cm = confusion_matrix(conditions, predictions, labels, sample_weight)
self.sum_predictions = np.sum(self.cm, axis=0)
self.sum_conditions = np.sum(self.cm, axis=1)
self.all = np.sum(self.cm)
def label_to_idx(self, label):
return self.label2idx[label]
def true_positives(self, idx):
return self.cm[idx, idx]
def true_negatives(self, idx):
return self.all - self.sum_predictions[idx] - self.sum_conditions[idx] + self.true_positives(idx)
def false_positives(self, idx):
return self.sum_predictions[idx] - self.true_positives(idx)
def false_negatives(self, idx):
return self.sum_conditions[idx] - self.true_positives(idx)
def true_positive_rate(self, idx):
nom = self.true_positives(idx)
den = self.sum_conditions[idx]
if den == 0 or den == np.nan:
return 0
else:
return nom / den
def true_negative_rate(self, idx):
nom = tn = self.true_negatives(idx)
den = tn + self.false_positives(idx)
if den == 0 or den == np.nan:
return 0
else:
return nom / den
def positive_predictive_value(self, idx):
nom = self.true_positives(idx)
den = self.sum_predictions[idx]
if den == 0 or den == np.nan:
return 0
else:
return nom / den
def negative_predictive_value(self, idx):
nom = tn = self.true_negatives(idx)
den = tn + self.false_negatives(idx)
if den == 0 or den == np.nan:
return 0
else:
return nom / den
def false_negative_rate(self, idx):
return 1.0 - self.true_positive_rate(idx)
def false_positive_rate(self, idx):
return 1.0 - self.true_negative_rate(idx)
def false_discovery_rate(self, idx):
return 1.0 - self.positive_predictive_value(idx)
def false_omission_rate(self, idx):
return 1.0 - self.negative_predictive_value(idx)
def accuracy(self, idx):
nom = self.true_positives(idx) + self.true_negatives(idx)
den = self.all
if den == 0 or den == np.nan:
return 0
else:
return nom / den
def precision(self, idx):
return self.positive_predictive_value(idx)
def recall(self, idx):
return self.true_positive_rate(idx)
def fbeta_score(self, beta, idx):
beta_2 = np.power(beta, 2)
precision = self.precision(idx)
recall = self.recall(idx)
nom = (1 + beta_2) * precision * recall
den = (beta_2 * precision) + recall
if den == 0 or den == np.nan:
return 0
else:
return nom / den
def f1_score(self, idx):
return self.fbeta_score(1, idx)
def sensitivity(self, idx):
return self.true_positive_rate(idx)
def specificity(self, idx):
return self.true_negative_rate(idx)
def hit_rate(self, idx):
return self.true_positive_rate(idx)
def miss_rate(self, idx):
return self.false_negative_rate(idx)
def fall_out(self, idx):
return self.false_positive_rate(idx)
def matthews_correlation_coefficient(self, idx):
tp = self.true_positives(idx)
tn = self.true_negatives(idx)
fp = self.false_positives(idx)
fn = self.false_negatives(idx)
nom = tp * tn - fp * fn
den = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
if den == 0 or den == np.nan:
return 0
else:
return nom / den
def informedness(self, idx):
return self.true_positive_rate(idx) + self.true_negative_rate(idx) - 1
def markedness(self, idx):
return self.positive_predictive_value(idx) + self.negative_predictive_value(idx) - 1
def token_accuracy(self):
return metrics.accuracy_score(self.conditions, self.predictions)
def avg_precision(self, average="macro"):
return metrics.precision_score(self.conditions, self.predictions, average=average)
def avg_recall(self, average="macro"):
return metrics.recall_score(self.conditions, self.predictions, average=average)
def avg_f1_score(self, average="macro"):
return metrics.f1_score(self.conditions, self.predictions, average=average)
def avg_fbeta_score(self, beta, average="macro"):
return metrics.fbeta_score(self.conditions, self.predictions, beta=beta, average=average)
def kappa_score(self):
return metrics.cohen_kappa_score(self.conditions, self.predictions)
def class_stats(self, idx):
return {
"true_positives": self.true_positives(idx),
"true_negatives": self.true_negatives(idx),
"false_positives": self.false_positives(idx),
"false_negatives": self.false_negatives(idx),
"true_positive_rate": self.true_positive_rate(idx),
"true_negative_rate": self.true_negative_rate(idx),
"positive_predictive_value": self.positive_predictive_value(idx),
"negative_predictive_value": self.negative_predictive_value(idx),
"false_negative_rate": self.false_negative_rate(idx),
"false_positive_rate": self.false_positive_rate(idx),
"false_discovery_rate": self.false_discovery_rate(idx),
"false_omission_rate": self.false_omission_rate(idx),
"accuracy": self.accuracy(idx),
"precision": self.precision(idx),
"recall": self.recall(idx),
"f1_score": self.f1_score(idx),
"sensitivity": self.sensitivity(idx),
"specificity": self.specificity(idx),
"hit_rate": self.hit_rate(idx),
"miss_rate": self.miss_rate(idx),
"fall_out": self.fall_out(idx),
"matthews_correlation_coefficient": self.matthews_correlation_coefficient(idx),
"informedness": self.informedness(idx),
"markedness": self.markedness(idx),
}
def per_class_stats(self):
stats = OrderedDict()
for idx in sorted(self.idx2label.keys()):
stats[self.idx2label[idx]] = self.class_stats(idx)
return stats
def stats(self):
return {
"token_accuracy": self.token_accuracy(),
"avg_precision_macro": self.avg_precision(average="macro"),
"avg_recall_macro": self.avg_recall(average="macro"),
"avg_f1_score_macro": self.avg_f1_score(average="macro"),
"avg_precision_micro": self.avg_precision(average="micro"),
"avg_recall_micro": self.avg_recall(average="micro"),
"avg_f1_score_micro": self.avg_f1_score(average="micro"),
"avg_precision_weighted": self.avg_precision(average="micro"),
"avg_recall_weighted": self.avg_recall(average="micro"),
"avg_f1_score_weighted": self.avg_f1_score(average="weighted"),
"kappa_score": self.kappa_score(),
}
def roc_curve(conditions, prediction_scores, pos_label=None, sample_weight=None):
return metrics.roc_curve(conditions, prediction_scores, pos_label=pos_label, sample_weight=sample_weight)
def roc_auc_score(conditions, prediction_scores, average="micro", sample_weight=None):
try:
return metrics.roc_auc_score(conditions, prediction_scores, average=average, sample_weight=sample_weight)
except ValueError as ve:
logger.info(ve)
def precision_recall_curve(conditions, prediction_scores, pos_label=None, sample_weight=None):
return metrics.precision_recall_curve(
conditions, prediction_scores, pos_label=pos_label, sample_weight=sample_weight
)
def average_precision_score(conditions, prediction_scores, average="micro", sample_weight=None):
# average == [micro, macro, sampled, weidhted]
return metrics.average_precision_score(conditions, prediction_scores, average=average, sample_weight=sample_weight)
================================================
FILE: ludwig/utils/fs_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2021 Linux Foundation.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import contextlib
import errno
import functools
import logging
import os
import pathlib
import shutil
import tempfile
import uuid
from urllib.parse import unquote, urlparse
import certifi
import fsspec
import h5py
import pyarrow.fs
import urllib3
from filelock import FileLock
from fsspec.core import split_protocol
from ludwig.api_annotations import DeveloperAPI
logger = logging.getLogger(__name__)
@DeveloperAPI
def get_default_cache_location() -> str:
"""Returns a path to the default LUDWIG_CACHE location, or $HOME/.ludwig_cache."""
cache_path = None
if "LUDWIG_CACHE" in os.environ and os.environ["LUDWIG_CACHE"]:
cache_path = os.environ["LUDWIG_CACHE"]
else:
cache_path = str(pathlib.Path.home().joinpath(".ludwig_cache"))
# Check if the cache path exists, if not create it
if not os.path.exists(cache_path):
os.makedirs(cache_path)
return cache_path
@DeveloperAPI
def get_fs_and_path(url):
protocol, path = split_protocol(url)
# Parse the url to get only the escaped url path
path = unquote(urlparse(path).path)
# Create a windows compatible path from url path
path = os.fspath(pathlib.PurePosixPath(path))
fs = fsspec.filesystem(protocol)
return fs, path
@DeveloperAPI
def has_remote_protocol(url):
protocol, _ = split_protocol(url)
return protocol and protocol != "file"
@DeveloperAPI
def is_http(urlpath):
protocol, _ = split_protocol(urlpath)
return protocol == "http" or protocol == "https"
@DeveloperAPI
def upgrade_http(urlpath):
protocol, url = split_protocol(urlpath)
if protocol == "http":
return "https://" + url
return None
@DeveloperAPI
@functools.lru_cache(maxsize=32)
def get_bytes_obj_from_path(path: str) -> bytes | None:
if is_http(path):
try:
return get_bytes_obj_from_http_path(path)
except Exception as e:
logger.warning(e)
return None
else:
try:
with open_file(path) as f:
return f.read()
except OSError as e:
logger.warning(e)
return None
@DeveloperAPI
def stream_http_get_request(path: str) -> urllib3.response.HTTPResponse:
if upgrade_http(path):
http = urllib3.PoolManager()
else:
http = urllib3.PoolManager(ca_certs=certifi.where())
resp = http.request("GET", path, preload_content=False)
return resp
@DeveloperAPI
@functools.lru_cache(maxsize=32)
def get_bytes_obj_from_http_path(path: str) -> bytes:
resp = stream_http_get_request(path)
if resp.status == 404:
upgraded = upgrade_http(path)
if upgraded:
logger.info(f"reading url {path} failed. upgrading to https and retrying")
return get_bytes_obj_from_http_path(upgraded)
else:
raise urllib3.exceptions.HTTPError(f"reading url {path} failed and cannot be upgraded to https")
# stream data
data = b""
for chunk in resp.stream(1024):
data += chunk
return data
@DeveloperAPI
def find_non_existing_dir_by_adding_suffix(directory_name):
fs, _ = get_fs_and_path(directory_name)
suffix = 0
curr_directory_name = directory_name
while fs.exists(curr_directory_name):
curr_directory_name = directory_name + "_" + str(suffix)
suffix += 1
return curr_directory_name
@DeveloperAPI
def abspath(url):
protocol, _ = split_protocol(url)
if protocol is not None:
# we assume any path containing an explicit protovol is fully qualified
return url
return os.path.abspath(url)
@DeveloperAPI
def path_exists(url):
fs, path = get_fs_and_path(url)
return fs.exists(path)
@DeveloperAPI
def listdir(url):
fs, path = get_fs_and_path(url)
return fs.listdir(path)
@DeveloperAPI
def safe_move_file(src, dst):
"""Rename a file from `src` to `dst`. Inspired by: https://alexwlchan.net/2019/03/atomic-cross-filesystem-
moves-in-python/
* Moves must be atomic. `shutil.move()` is not atomic.
* Moves must work across filesystems. Sometimes temp directories and the
model directories live on different filesystems. `os.replace()` will
throw errors if run across filesystems.
So we try `os.replace()`, but if we detect a cross-filesystem copy, we
switch to `shutil.move()` with some wrappers to make it atomic.
"""
try:
os.replace(src, dst)
except OSError as err:
if err.errno == errno.EXDEV:
# Generate a unique ID, and copy `` to the target directory with a temporary name `..tmp`.
# Because we're copying across a filesystem boundary, this initial copy may not be atomic. We insert a
# random UUID so if different processes are copying into ``, they don't overlap in their tmp copies.
copy_id = uuid.uuid4()
tmp_dst = f"{dst}.{copy_id}.tmp"
shutil.copyfile(src, tmp_dst)
# Atomic replace file onto the new name, and clean up original source file.
os.replace(tmp_dst, dst)
os.unlink(src)
else:
raise
@DeveloperAPI
def safe_move_directory(src, dst):
"""Recursively moves files from src directory to dst directory and removes src directory.
If dst directory does not exist, it will be created.
"""
try:
os.replace(src, dst)
except OSError as err:
if err.errno == errno.EXDEV:
# Generate a unique ID, and copy `` to the target directory with a temporary name `..tmp`.
# Because we're copying across a filesystem boundary, this initial copy may not be atomic. We insert a
# random UUID so if different processes are copying into ``, they don't overlap in their tmp copies.
copy_id = uuid.uuid4()
tmp_dst = f"{dst}.{copy_id}.tmp"
shutil.copytree(src, tmp_dst)
# Atomic replace directory name onto the new name, and clean up original source directory.
os.replace(tmp_dst, dst)
os.unlink(src)
else:
raise
@DeveloperAPI
def rename(src, tgt):
protocol, _ = split_protocol(tgt)
if protocol is not None:
fs = fsspec.filesystem(protocol)
fs.mv(src, tgt, recursive=True)
else:
safe_move_file(src, tgt)
@DeveloperAPI
def upload_file(src, tgt):
protocol, _ = split_protocol(tgt)
fs = fsspec.filesystem(protocol)
fs.put(src, tgt)
@DeveloperAPI
def copy(src, tgt, recursive=False):
protocol, _ = split_protocol(tgt)
fs = fsspec.filesystem(protocol)
fs.copy(src, tgt, recursive=recursive)
@DeveloperAPI
def makedirs(url, exist_ok=False):
fs, path = get_fs_and_path(url)
fs.makedirs(path, exist_ok=exist_ok)
@DeveloperAPI
def delete(url, recursive=False):
fs, path = get_fs_and_path(url)
return fs.delete(path, recursive=recursive)
@DeveloperAPI
def upload(lpath, rpath):
fs, path = get_fs_and_path(rpath)
pyarrow.fs.copy_files(lpath, path, destination_filesystem=pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(fs)))
@DeveloperAPI
def download(rpath, lpath):
fs, path = get_fs_and_path(rpath)
pyarrow.fs.copy_files(path, lpath, source_filesystem=pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(fs)))
@DeveloperAPI
def checksum(url):
fs, path = get_fs_and_path(url)
return fs.checksum(path)
@DeveloperAPI
def to_url(path):
protocol, _ = split_protocol(path)
if protocol is not None:
return path
return pathlib.Path(os.path.abspath(path)).as_uri()
@DeveloperAPI
@contextlib.contextmanager
def upload_output_directory(url):
if url is None:
yield None, None
return
if has_remote_protocol(url):
# To avoid extra network load, write all output files locally at runtime,
# then upload to the remote fs at the end.
with tempfile.TemporaryDirectory() as tmpdir:
fs, remote_path = get_fs_and_path(url)
# In cases where we are resuming from a previous run, we first need to download
# the artifacts from the remote filesystem
if path_exists(url):
fs.get(url, tmpdir + "/", recursive=True)
def put_fn():
# Use pyarrow API here as fs.put() is inconsistent in where it uploads the file
# See: https://github.com/fsspec/filesystem_spec/issues/1062
pyarrow.fs.copy_files(
tmpdir, remote_path, destination_filesystem=pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(fs))
)
# Write to temp directory locally
yield tmpdir, put_fn
# Upload to remote when finished
put_fn()
else:
# For local paths (including file:// URIs), use the path directly.
_, local_path = get_fs_and_path(url)
makedirs(local_path, exist_ok=True)
yield local_path, None
@DeveloperAPI
@contextlib.contextmanager
def open_file(url, *args, **kwargs):
fs, path = get_fs_and_path(url)
with fs.open(path, *args, **kwargs) as f:
yield f
@DeveloperAPI
@contextlib.contextmanager
def download_h5(url):
with tempfile.TemporaryDirectory() as tmpdir:
local_path = os.path.join(tmpdir, os.path.basename(url))
fs, path = get_fs_and_path(url)
fs.get(path, local_path)
with h5py.File(local_path, "r") as f:
yield f
@DeveloperAPI
@contextlib.contextmanager
def upload_h5(url):
with upload_output_file(url) as local_fname:
mode = "w"
if url == local_fname and path_exists(url):
mode = "r+"
with h5py.File(local_fname, mode) as f:
yield f
@DeveloperAPI
@contextlib.contextmanager
def upload_output_file(url):
"""Takes a remote URL as input, returns a temp filename, then uploads it when done."""
protocol, _ = split_protocol(url)
if protocol is not None:
fs = fsspec.filesystem(protocol)
with tempfile.TemporaryDirectory() as tmpdir:
local_fname = os.path.join(tmpdir, "tmpfile")
yield local_fname
fs.put(local_fname, url, recursive=True)
else:
yield url
@DeveloperAPI
class file_lock(contextlib.AbstractContextManager):
"""File lock based on filelock package."""
def __init__(self, path: str, ignore_remote_protocol: bool = True, lock_file: str = ".lock") -> None:
if not isinstance(path, (str, os.PathLike, pathlib.Path)):
self.lock = None
else:
path = os.path.join(path, lock_file) if os.path.isdir(path) else f"{path}./{lock_file}"
if ignore_remote_protocol and has_remote_protocol(path):
self.lock = None
else:
self.lock = FileLock(path, timeout=-1)
def __enter__(self, *args, **kwargs):
if self.lock:
return self.lock.__enter__(*args, **kwargs)
def __exit__(self, *args, **kwargs):
if self.lock:
return self.lock.__exit__(*args, **kwargs)
@DeveloperAPI
def list_file_names_in_directory(directory_name: str) -> list[str]:
file_path: pathlib.Path # noqa [F842] # incorrect flagging of "local variable is annotated but never used"
file_names: list[str] = [
file_path.name for file_path in pathlib.Path(directory_name).iterdir() if file_path.is_file()
]
return file_names
================================================
FILE: ludwig/utils/h3_util.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from typing import NamedTuple
class H3Data(NamedTuple):
mode: int
edge: int
resolution: int
base_cell: int
cells: list[int]
def set_bit(v, index, x):
"""Set the index:th bit of v to 1 if x is truthy, else to 0, and return the new value."""
mask = 1 << index # Compute mask, an integer with just bit 'index' set.
v &= ~mask # Clear the bit indicated by the mask (if x is False)
if x:
v |= mask # If x was True, set the bit indicated by the mask.
return v # Return the result, we're done.
def set_bits(v, start_bit, slice_length, x):
bin_x = bin(x)
for i, index in enumerate(range(start_bit, start_bit + slice_length)):
val = int(bin_x[-(i + 1)]) if 2 + i < len(bin_x) else 0
v = set_bit(v, index, val)
return v
def components_to_h3(components):
h3 = 18446744073709551615
h3 = set_bits(h3, 64 - 5, 4, components["mode"])
h3 = set_bits(h3, 64 - 8, 3, components["edge"])
h3 = set_bits(h3, 64 - 12, 4, components["resolution"])
h3 = set_bits(h3, 64 - 19, 7, components["base_cell"])
for i, cell in enumerate(components["cells"]):
h3 = set_bits(h3, 64 - 19 - (i + 1) * 3, 3, cell)
h3 = set_bits(h3, 64 - 1, 4, 0)
return h3
def bitslice(x: int, start_bit: int, slice_length: int) -> int:
ones_mask: int = int(2**slice_length - 1)
return (x & (ones_mask << start_bit)) >> start_bit
def h3_index_mode(h3_long: int) -> int:
return bitslice(h3_long, 64 - 5, 4)
def h3_edge(h3_long: int) -> int:
return bitslice(h3_long, 64 - 8, 3)
def h3_resolution(h3_long: int) -> int:
return bitslice(h3_long, 64 - 12, 4)
def h3_base_cell(h3_long: int) -> int:
return bitslice(h3_long, 64 - 19, 7)
def h3_octal_components(h3_long):
res = h3_resolution(h3_long)
return "{0:0{w}o}".format(bitslice(h3_long + 2**63, 64 - 19 - 3 * res, 3 * res), w=res)
def h3_component(h3_long: int, i: int) -> int:
return bitslice(h3_long, 64 - 19 - 3 * i, 3)
def h3_components(h3_long: int) -> list[int]:
return [h3_component(h3_long, i) for i in range(1, h3_resolution(h3_long) + 1)]
def h3_to_components(h3_value: int) -> H3Data:
"""Extract the values from an H3 hexadecimal value Refer to this for the bit layout:
https://uber.github.io/h3/#/documentation/core-library/h3-index-representations
"""
# lat_long = (0, 0) # h3ToGeo(h3_value)
return H3Data(
mode=h3_index_mode(h3_value),
edge=h3_edge(h3_value),
resolution=h3_resolution(h3_value),
base_cell=h3_base_cell(h3_value),
cells=h3_components(h3_value),
)
if __name__ == "__main__":
value = 622236723497533439
components = h3_to_components(value)
h3 = components_to_h3(components)
components2 = h3_to_components(h3)
print(value)
print(components)
print(h3)
print(components2)
================================================
FILE: ludwig/utils/heuristics.py
================================================
from ludwig.schema.model_config import ModelConfig
from ludwig.utils.config_utils import has_pretrained_encoder, has_trainable_encoder, has_unstructured_input_feature
def get_auto_learning_rate(config: ModelConfig) -> float:
"""Uses config heuristics to determine an appropriate learning rate.
The main idea behind the following heuristics is that smaller learning rates are more
suitable for features with larger encoders, which are typically used with unstructured features.
Note that these are meant to be rough heuristics that are solely based on feature types and the
type of the corresponding encoder. More factors could be taken into consideration such as model
size, dataset size, batch size, number of features, etc.
Args:
config: Ludwig config used to train the model.
"""
if not has_unstructured_input_feature(config):
return 0.001
if not has_pretrained_encoder(config):
return 0.0001
if has_trainable_encoder(config):
return 0.00001
return 0.00002
================================================
FILE: ludwig/utils/hf_utils.py
================================================
import logging
import os
import tempfile
from os import PathLike
from transformers import AutoTokenizer, PreTrainedModel
from transformers.tokenization_utils import PreTrainedTokenizer
from ludwig.api_annotations import DeveloperAPI
from ludwig.utils.error_handling_utils import default_retry
from ludwig.utils.fs_utils import download, path_exists
from ludwig.utils.upload_utils import hf_hub_login
logger = logging.getLogger(__name__)
@default_retry()
def load_pretrained_hf_model_from_hub(
model_class: type,
pretrained_model_name_or_path: str | PathLike | None,
**pretrained_kwargs,
) -> PreTrainedModel:
"""Download a HuggingFace model.
Downloads a model from the HuggingFace zoo with retry on failure.
Args:
model_class: Class of the model to download.
pretrained_model_name_or_path: Name of the model to download.
pretrained_kwargs: Additional arguments to pass to the model constructor.
Returns:
The pretrained model object.
"""
return model_class.from_pretrained(pretrained_model_name_or_path, **pretrained_kwargs)
@default_retry()
def load_pretrained_hf_tokenizer(
pretrained_model_name_or_path: str | PathLike | None, **pretrained_kwargs
) -> PreTrainedTokenizer:
"""Download a HuggingFace tokenizer.
Args:
pretrained_model_name_or_path: Name of the tokenizer to download.
pretrained_kwargs: Additional arguments to pass to the tokenizer constructor.
Returns:
The pretrained tokenizer object.
"""
return AutoTokenizer.from_pretrained(pretrained_model_name_or_path, **pretrained_kwargs)
def _load_pretrained_hf_model_from_dir(
model_class: type,
pretrained_model_name_or_path: str | PathLike | None,
**pretrained_kwargs,
) -> PreTrainedModel:
"""Downloads a model to a local temporary directory, and Loads a pretrained HF model from a local directory."""
with tempfile.TemporaryDirectory() as tmpdir:
download(pretrained_model_name_or_path, tmpdir)
return model_class.from_pretrained(tmpdir, **pretrained_kwargs)
@DeveloperAPI
def load_pretrained_hf_model_with_hub_fallback(
model_class: type,
pretrained_model_name_or_path: str | PathLike | None,
**pretrained_kwargs,
) -> tuple[PreTrainedModel, bool]:
"""Returns the model and a boolean indicating whether the model was downloaded from the HuggingFace hub.
If the `LUDWIG_PRETRAINED_MODELS_DIR` environment variable is set, we attempt to load the HF model from this
directory, falling back to downloading from the HF hub if the model is not found, downloading fails, or if model
initialization fails.
`LUDWIG_PRETRAINED_MODELS_DIR` can be an s3 path. Weights are copied to a local temporary directory, and the model
is loaded from there.
The expected structure of the `LUDWIG_PRETRAINED_MODELS_DIR` directory is:
{LUDWIG_PRETRAINED_MODELS_DIR}/{pretrained_model_name_or_path}/pytorch_model.bin
{LUDWIG_PRETRAINED_MODELS_DIR}/{pretrained_model_name_or_path}/config.json
For example, if `LUDWIG_PRETRAINED_MODELS_DIR` is set to `s3://my-bucket/pretrained-models`, and
`pretrained_model_name_or_path` is set to `bert-base-uncased`, we expect to find the following files:
s3://my-bucket/bert-base-uncased/
- pytorch_model.bin
- config.json
If the `LUDWIG_PRETRAINED_MODELS_DIR` environment variable is not set, we download the model from the HF hub.
"""
pretrained_models_dir = os.environ.get("LUDWIG_PRETRAINED_MODELS_DIR")
if pretrained_models_dir:
pretrained_model_path = os.path.join(pretrained_models_dir, pretrained_model_name_or_path)
if path_exists(pretrained_model_path):
try:
logger.info(
f"Found existing pretrained model artifact {pretrained_model_name_or_path} in directory "
f"{pretrained_models_dir}. Downloading."
)
return (
_load_pretrained_hf_model_from_dir(model_class, pretrained_model_path, **pretrained_kwargs),
False,
)
except Exception as e:
logger.warning(
f"Failed to download pretrained model from {pretrained_models_dir} with error {e}. "
"Falling back to HuggingFace model hub."
)
# Fallback to HF hub.
return load_pretrained_hf_model_from_hub(model_class, pretrained_model_name_or_path, **pretrained_kwargs), True
def upload_folder_to_hfhub(
repo_id: str,
folder_path: str,
repo_type: str | None = "model",
private: bool | None = False,
path_in_repo: str | None = None, # defaults to root of repo
commit_message: str | None = None,
commit_description: str | None = None,
) -> None:
"""Uploads a local folder to the Hugging Face Model Hub.
Args:
repo_id (str): The ID of the target repository on the Hugging Face Model Hub.
folder_path (str): The local path to the folder to be uploaded.
repo_type (str, optional): The type of the repository ('model', 'dataset', or 'space').
Defaults to 'model'.
private (bool, optional): If True, the repository will be private; otherwise, it will be public.
Defaults to False.
path_in_repo (str, optional): The relative path within the repository where the folder should be uploaded.
Defaults to None, which means the root of the repository.
commit_message (str, optional): A message for the commit associated with the upload.
commit_description (str, optional): A description for the commit associated with the upload.
Raises:
FileNotFoundError: If the specified folder does not exist.
ValueError: If the specified folder is empty, a file, or if an invalid 'repo_type' is provided.
ValueError: If the upload process fails for any reason.
Returns:
None
"""
# Make sure the folder exists
if not os.path.exists(folder_path):
raise FileNotFoundError(f"Folder {folder_path} does not exist.")
# Make sure the folder is not a file
if os.path.isfile(folder_path):
raise ValueError(f"Folder {folder_path} is a file. Please provide a folder.")
# Make sure the folder is not empty
if not os.listdir(folder_path):
raise ValueError(f"Folder {folder_path} is empty.")
if repo_type not in {"model", "dataset", "space"}:
raise ValueError(f"Invalid repo_type {repo_type}. Valid values are 'model', 'dataset', and 'space'.")
# Login to the hub
api = hf_hub_login()
# Create the repo if it doesn't exist. This is a no-op if the repo already exists
# This is required because the API doesn't allow uploading to a non-existent repo
if not api.repo_exists(repo_id, repo_type=repo_type):
logger.info(f"{repo_id} does not exist. Creating.")
api.create_repo(repo_id, private=private, exist_ok=True, repo_type=repo_type)
# Upload the folder
try:
logger.info(f"Uploading folder {folder_path} to repo {repo_id}.")
api.upload_folder(
repo_id=repo_id,
folder_path=folder_path,
repo_type=repo_type,
path_in_repo=path_in_repo,
commit_message=commit_message,
commit_description=commit_description,
)
except Exception as e:
raise ValueError(f"Failed to upload folder {folder_path} to repo {repo_id}") from e
================================================
FILE: ludwig/utils/html_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import re
from html.parser import HTMLParser
from ludwig.utils import strings_utils
logger = logging.getLogger(__name__)
class HTMLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs = True
self.fed = []
def handle_data(self, data):
self.fed.append(data)
def get_data(self):
return "".join(self.fed)
def error(self, message):
logger.error(message)
def strip_tags(html):
stripper = HTMLStripper()
stripper.feed(html)
return stripper.get_data()
# regular expressions for cleaning text
res_pre = [(re.compile(r"([^.:;\?\!>])( )"), r"\1.\2"), (re.compile(r" "), r" ")]
res_post = [
(re.compile(r"[ \t\0]"), r" "),
(re.compile(r"[–_]"), r"-"),
(
re.compile(r"[\’\‘]"),
r"""),
(re.compile(r'[”“]]'), r""",
),
(re.compile(r"℅"), r"%"),
(re.compile(r"([^.>])( )"), r"\1.\2"),
(re.compile(r"\\\\[NnRr]"), r" "),
(re.compile(r"\\[NnRr]"), r" "),
(re.compile(r"[\n\r]"), r" "),
(re.compile(r"\\\\"), r" / "),
(re.compile(r" "), r" "),
(re.compile(r"\\\\" ""), r"\'"),
(re.compile(r"^\'([^\']+)$"), r"\1"),
(re.compile(r"([\<\>\{\}\[\]\(\)\-\+\=:;,\./\?\!\$%&£#@\'₹ ])\1+"), r"\1"),
(
re.compile(
r"[^qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890\<\>\{\}\[\]\(\)\-\+\=:;,\./\?\!\$%&£#@\'₹ ]" # noqa
),
r" ",
),
(re.compile(r"\s{2,}"), r" "),
]
def clean_html(html_text):
# print()
# print(html_text)
html_text, matched = strings_utils.match_replace(html_text, res_pre)
# print(html_text)
html_text = strip_tags(html_text)
# print(html_text)
html_text = strings_utils.strip_accents(html_text)
# print(html_text)
# result = html_text.strip(
# 'qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890\<\>\{\}\[\]\(\)\-\+\=:;,\./\?\!\$%&€£#@'₹\' ')
# if result:
# print(result)
html_text, matched = strings_utils.match_replace(html_text, res_post)
# print(matched)
# print(html_text)
return html_text
================================================
FILE: ludwig/utils/image_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import warnings
from collections.abc import Callable, Iterable
from dataclasses import dataclass
from io import BytesIO
import numpy as np
import torch
import torchvision.transforms.functional as F
from torchvision.io import decode_image, ImageReadMode
from torchvision.models._api import WeightsEnum
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import CROP_OR_PAD, IMAGE_MAX_CLASSES, INTERPOLATE
from ludwig.utils.data_utils import get_abs_path
from ludwig.utils.fs_utils import get_bytes_obj_from_path
from ludwig.utils.registry import Registry
@dataclass
class TVModelVariant:
# Model variant identifier
variant_id: str | int
# TorchVision function to create model class
create_model_function: Callable
# Torchvision class for model weights
model_weights: WeightsEnum
logger = logging.getLogger(__name__)
IMAGE_EXTENSIONS = (".png", ".jpg", ".jpeg", ".tiff", ".tif", ".bmp", ".gif")
@DeveloperAPI
class ResizeChannels(torch.nn.Module):
def __init__(self, num_channels: int):
super().__init__()
self.num_channels = num_channels
def forward(self, imgs: torch.Tensor):
original_imgs_shape = imgs.shape
if len(original_imgs_shape) == 3: # if shape is (C, H, W), add batch dimension
imgs = imgs.unsqueeze(0)
channels = imgs.shape[1]
if channels > self.num_channels:
# take the first `self.num_channels` channels
imgs = imgs[:, : self.num_channels, :, :]
elif channels < self.num_channels:
# repeat and use the first `self.num_channels` channels
imgs = imgs.repeat(1, (self.num_channels // channels) + 1, 1, 1)[:, : self.num_channels, :, :]
if len(original_imgs_shape) == 3: # if shape was (C, H, W), remove batch dimension
return imgs[0]
return imgs
@DeveloperAPI
def get_gray_default_image(num_channels: int, height: int, width: int) -> np.ndarray:
return np.full((num_channels, height, width), 128, dtype=np.float32)
@DeveloperAPI
def get_average_image(image_lst: list[np.ndarray]) -> np.array:
return np.mean([x for x in image_lst if x is not None], axis=(0), dtype=np.float32)
@DeveloperAPI
def is_bytes_image(bytes_obj) -> bool:
"""Check if a bytes object is an image using PIL."""
try:
from io import BytesIO
from PIL import Image
if isinstance(bytes_obj, bytes):
bytes_obj = BytesIO(bytes_obj)
Image.open(bytes_obj).verify()
return True
except Exception:
return False
def is_image(src_path: str, img_entry: bytes | str, column: str) -> bool:
if not isinstance(img_entry, str):
return False
try:
from io import BytesIO
from PIL import Image
path = get_abs_path(src_path, img_entry)
bytes_obj = get_bytes_obj_from_path(path)
if isinstance(bytes_obj, bytes):
bytes_obj = BytesIO(bytes_obj)
Image.open(bytes_obj).verify()
return True
except Exception as e:
logger.warning(f"While assessing potential image in is_image() for column {column}, encountered exception: {e}")
return False
@DeveloperAPI
def is_image_score(path):
return int(isinstance(path, str) and path.lower().endswith(IMAGE_EXTENSIONS))
@DeveloperAPI
def get_image_read_mode_from_num_channels(num_channels: int) -> ImageReadMode:
"""Returns the torchvision.io.ImageReadMode corresponding to the number of channels.
If num_channels is not recognized, returns ImageReadMode.UNCHANGED.
"""
mode = ImageReadMode.UNCHANGED
if num_channels == 1:
mode = ImageReadMode.GRAY
elif num_channels == 2:
mode = ImageReadMode.GRAY_ALPHA
elif num_channels == 3:
mode = ImageReadMode.RGB
elif num_channels == 4:
mode = ImageReadMode.RGB_ALPHA
return mode
@DeveloperAPI
def read_image_from_path(
path: str, num_channels: int | None = None, return_num_bytes=False
) -> torch.Tensor | None | tuple[torch.Tensor | None, int]:
"""Reads image from path.
Useful for reading from a small number of paths. For more intensive reads, use backend.read_binary_files instead. If
`return_num_bytes` is True, returns a tuple of (image, num_bytes).
"""
bytes_obj = get_bytes_obj_from_path(path)
image = read_image_from_bytes_obj(bytes_obj, num_channels)
if return_num_bytes:
if bytes_obj is not None:
num_bytes = len(bytes_obj)
else:
num_bytes = None
return image, num_bytes
else:
return image
@DeveloperAPI
def read_image_from_bytes_obj(bytes_obj: bytes | None = None, num_channels: int | None = None) -> torch.Tensor | None:
"""Tries to read image as a tensor from the path.
If the path is not decodable as a PNG, attempts to read as a numpy file. If neither of these work, returns None.
"""
if bytes_obj is None:
return None
mode = get_image_read_mode_from_num_channels(num_channels)
image = read_image_as_png(bytes_obj, mode)
if image is None:
image = read_image_as_numpy(bytes_obj)
if image is None:
image = read_image_as_tif(bytes_obj)
if image is None:
warnings.warn("Unable to read image from bytes object.")
return image
@DeveloperAPI
def read_image_as_png(bytes_obj: bytes, mode: ImageReadMode = ImageReadMode.UNCHANGED) -> torch.Tensor | None:
"""Reads image from bytes object from a PNG file."""
try:
with BytesIO(bytes_obj) as buffer:
buffer_view = buffer.getbuffer()
if len(buffer_view) == 0:
del buffer_view
raise Exception("Bytes object is empty. This could be due to a failed load from storage.")
image = decode_image(torch.frombuffer(buffer_view, dtype=torch.uint8), mode=mode)
del buffer_view
return image
except Exception as e:
warnings.warn(f"Failed to read image from PNG file. Original exception: {e}")
return None
@DeveloperAPI
def read_image_as_numpy(bytes_obj: bytes) -> torch.Tensor | None:
"""Reads image from bytes object from a numpy file."""
try:
with BytesIO(bytes_obj) as buffer:
image = np.load(buffer)
return torch.from_numpy(image)
except Exception as e:
warnings.warn(f"Failed to read image from numpy file. Original exception: {e}")
return None
@DeveloperAPI
def read_image_as_tif(bytes_obj: bytes) -> torch.Tensor | None:
"""Reads image from bytes object from a tif file."""
try:
import tifffile
with BytesIO(bytes_obj) as buffer:
image = tifffile.imread(buffer)
if image.dtype == np.uint16:
image = image.astype(np.int32)
image = torch.from_numpy(image)
if len(image.shape) == 2:
image = torch.unsqueeze(image, dim=0)
return image
except Exception as e:
warnings.warn(f"Failed to read image from tif file. Original exception: {e}")
return None
@DeveloperAPI
def pad(
img: torch.Tensor,
new_size: int | tuple[int, int],
) -> torch.Tensor:
"""Torchscript-compatible implementation of pad.
Args:
img (torch.Tensor): image with shape [..., height, width] to pad
new_size (Union[int, Tuple[int, int]]): size to pad to. If int, resizes to square image of that size.
Returns:
torch.Tensor: padded image of size [..., size[0], size[1]] or [..., size, size] if size is int.
"""
new_size = to_tuple(new_size)
old_size = img.shape[-2:]
pad_size = (torch.tensor(new_size) - torch.tensor(old_size)) / 2
padding = torch.cat((torch.floor(pad_size), torch.ceil(pad_size)))
padding[padding < 0] = 0
padding = [int(x) for x in padding]
return F.pad(img, padding=padding, padding_mode="edge")
@DeveloperAPI
def crop(
img: torch.Tensor,
new_size: int | tuple[int, int],
) -> torch.Tensor:
"""Torchscript-compatible implementation of crop.
Args:
img (torch.Tensor): image with shape [..., height, width] to crop
size (Union[int, Tuple[int, int]]): size to crop to. If int, crops to square image of that size.
Returns:
torch.Tensor: cropped image of size [..., size[0], size[1]] or [..., size, size] if size is int.
"""
new_size = to_tuple(new_size)
return F.center_crop(img, output_size=new_size)
@DeveloperAPI
def crop_or_pad(img: torch.Tensor, new_size: int | tuple[int, int]):
"""Torchscript-compatible implementation of resize using constants.CROP_OR_PAD.
Args:
img (torch.Tensor): image with shape [..., height, width] to resize
new_size (Union[int, Tuple[int, int]]): size to resize to. If int, resizes to square image of that size.
Returns:
torch.Tensor: resized image of size [..., size[0], size[1]] or [..., size, size] if size is int.
"""
new_size = to_tuple(new_size)
if list(new_size) == list(img.shape[-2:]):
return img
img = pad(img, new_size)
img = crop(img, new_size)
return img
@DeveloperAPI
def resize_image(
img: torch.Tensor,
new_size: int | tuple[int, int],
resize_method: str,
crop_or_pad_constant: str = CROP_OR_PAD,
interpolate_constant: str = INTERPOLATE,
) -> torch.Tensor:
"""Torchscript-compatible implementation of resize.
Args:
img (torch.Tensor): image with shape [..., height, width] to resize
new_size (Union[int, Tuple[int, int]]): size to resize to. If int, resizes to square image of that size.
resize_method (str): method to use for resizing. Either constants.CROP_OR_PAD or constants.INTERPOLATE.
Returns:
torch.Tensor: resized image of size [..., size[0], size[1]] or [..., size, size] if size is int.
"""
new_size = to_tuple(new_size)
if list(img.shape[-2:]) != list(new_size):
if resize_method == crop_or_pad_constant:
return crop_or_pad(img, new_size)
elif resize_method == interpolate_constant:
return F.resize(img, new_size)
raise ValueError(f"Invalid image resize method: {resize_method}")
return img
@DeveloperAPI
def grayscale(img: torch.Tensor) -> torch.Tensor:
"""Grayscales RGB image."""
return F.rgb_to_grayscale(img)
@DeveloperAPI
def num_channels_in_image(img: torch.Tensor):
"""Returns number of channels in image."""
if img is None or img.ndim < 2:
raise ValueError("Invalid image data")
if img.ndim == 2:
return 1
else:
return img.shape[0]
@DeveloperAPI
def get_unique_channels(
image_sample: list[torch.Tensor],
num_channels: int,
num_classes: int = None,
) -> torch.Tensor:
"""Returns a tensor of unique channel values from a list of images.
Args:
image_sample: A list of images of dimensions [C x H x W] or [H x W], where C is the channel dimension
num_channels: The expected number of channels
num_classes: The expected number of classes or None
Return:
channel_class_map: A tensor mapping channel values to classes, where dim=0 is the class.
"""
n_images = 0
no_new_class = 0
channel_class_map = None
for img in image_sample:
if img.ndim < 2:
raise ValueError("Invalid image dimensions {img.ndim}")
if img.ndim == 2:
img = img.unsqueeze(0)
if num_channels == 1 and num_channels_in_image(img) != 1:
img = grayscale(img)
if num_classes == 2 and num_channels_in_image(img) == 1:
img = img.type(torch.float32) / 255
img = img.round() * 255
img = img.type(torch.uint8)
img = img.flatten(1, 2)
img = img.permute(1, 0)
uniq_chans = img.unique(dim=0)
if channel_class_map is None:
channel_class_map = uniq_chans
else:
channel_class_map = torch.concat((channel_class_map, uniq_chans)).unique(dim=0)
if channel_class_map.shape[0] > IMAGE_MAX_CLASSES:
raise ValueError(
f"Images inferred num classes {channel_class_map.shape[0]} exceeds " f"max classes {IMAGE_MAX_CLASSES}."
)
n_images += 1
if n_images % 25 == 0:
logger.info(f"Processed the first {n_images} images inferring {channel_class_map.shape[0]} classes...")
if channel_class_map.shape[0] == uniq_chans.shape[0]:
no_new_class += 1
if no_new_class >= 4 and channel_class_map.shape[0] == num_classes:
break # early loop exit
else:
no_new_class = 0
logger.info(f"Inferred {channel_class_map.shape[0]} classes from the first {n_images} images.")
return channel_class_map.type(torch.uint8)
@DeveloperAPI
def get_class_mask_from_image(
channel_class_map: torch.Tensor,
img: torch.Tensor,
) -> torch.Tensor:
"""Returns a masked image where each mask value is the channel class of the input.
Args:
channel_class_map: A tensor mapping channel values to classes, where dim=0 is the class.
img: An input image of dimensions [C x H x W] or [H x W], where C is the channel dimension
Return:
[mask] A masked image of dimensions [H x W] where each value is the channel class of the input
"""
num_classes = channel_class_map.shape[0]
mask = torch.full((img.shape[-2], img.shape[-1]), num_classes, dtype=torch.uint8)
if img.ndim == 2:
img = img.unsqueeze(0)
if num_classes == 2 and num_channels_in_image(img) == 1:
img = img.type(torch.float32) / 255
img = img.round() * 255
img = img.type(torch.uint8)
img = img.permute(1, 2, 0)
for nclass, value in enumerate(channel_class_map):
mask[(img == value).all(-1)] = nclass
if torch.any(mask.ge(num_classes)):
raise ValueError(
f"Image channel could not be mapped to a class because an unknown channel value was detected. "
f"{num_classes} classes were inferred from the first set of images. This image has a channel "
f"value that was not previously seen in the first set of images. Check preprocessing parameters "
f"for image resizing, num channels, num classes and num samples. Image resizing may affect "
f"channel values. "
)
return mask
@DeveloperAPI
def get_image_from_class_mask(
channel_class_map: torch.Tensor,
mask: np.ndarray,
) -> np.ndarray:
"""Returns an image with channel values determined from a corresponding mask.
Args:
channel_class_map: An tensor mapping channel values to classes, where dim=0 is the class.
mask: A masked image of dimensions [H x W] where each value is the channel class of the final image
Return:
[img] An image of dimensions [C x H x W], where C is the channel dimension
"""
mask = torch.from_numpy(mask)
img = torch.zeros(channel_class_map.shape[1], mask.shape[-2], mask.shape[-1], dtype=torch.uint8)
img = img.permute(1, 2, 0)
mask = mask.unsqueeze(0)
mask = mask.permute(1, 2, 0)
for nclass, value in enumerate(channel_class_map):
img[(mask == nclass).all(-1)] = value
img = img.permute(2, 0, 1)
return img.numpy()
@DeveloperAPI
def to_tuple(v: int | tuple[int, int]) -> tuple[int, int]:
"""Converts int or tuple to tuple of ints."""
if torch.jit.isinstance(v, int):
return v, v
else:
return v
@DeveloperAPI
def to_np_tuple(prop: int | Iterable) -> np.ndarray:
"""Creates a np array of length 2 from a Conv2D property.
E.g., stride=(2, 3) gets converted into np.array([2, 3]), where the height_stride = 2 and width_stride = 3. stride=2
gets converted into np.array([2, 2]).
"""
if isinstance(prop, int):
return np.ones(2).astype(int) * prop
elif isinstance(prop, np.ndarray) and prop.size == 2:
return prop.astype(int)
elif isinstance(prop, Iterable) and len(prop) == 2:
return np.array(list(prop)).astype(int)
else:
raise TypeError(f"prop must be int or iterable of length 2, but is {prop}.")
@DeveloperAPI
def get_img_output_shape(
img_height: int,
img_width: int,
kernel_size: int | tuple[int],
stride: int | tuple[int],
padding: int | tuple[int] | str,
dilation: int | tuple[int],
) -> tuple[int]:
"""Returns the height and width of an image after a 2D img op.
Currently supported for Conv2D, MaxPool2D and AvgPool2d ops.
"""
if padding == "same":
return (img_height, img_width)
elif padding == "valid":
padding = np.zeros(2)
else:
padding = to_np_tuple(padding)
kernel_size = to_np_tuple(kernel_size)
stride = to_np_tuple(stride)
dilation = to_np_tuple(dilation)
shape = np.array([img_height, img_width])
out_shape = np.floor(((shape + 2 * padding - dilation * (kernel_size - 1) - 1) / stride) + 1)
return tuple(out_shape.astype(int))
torchvision_model_registry = Registry()
def register_torchvision_model_variants(variants: list[TVModelVariant]):
def wrap(cls):
# prime with empty placeholder
torchvision_model_registry[cls.torchvision_model_type] = {}
# register each variant
for variant in variants:
torchvision_model_registry[cls.torchvision_model_type][variant.variant_id] = variant
return cls
return wrap
================================================
FILE: ludwig/utils/inference_utils.py
================================================
from datetime import datetime
import pandas as pd
import torch
from ludwig.constants import (
AUDIO,
BAG,
BINARY,
CATEGORY,
COLUMN,
DATE,
IMAGE,
NAME,
POSTPROCESSOR,
PREDICTOR,
PREPROCESSOR,
SEQUENCE,
SET,
TEXT,
TIMESERIES,
TYPE,
VECTOR,
)
from ludwig.types import FeatureConfigDict, ModelConfigDict
from ludwig.utils.audio_utils import read_audio_from_path
from ludwig.utils.date_utils import create_vector_from_datetime_obj
from ludwig.utils.image_utils import read_image_from_path
from ludwig.utils.torch_utils import place_on_device
from ludwig.utils.types import TorchDevice, TorchscriptPreprocessingInput
FEATURES_TO_CAST_AS_STRINGS = {BINARY, CATEGORY, BAG, SET, TEXT, SEQUENCE, TIMESERIES, VECTOR}
def get_filename_from_stage(stage: str, device: TorchDevice) -> str:
"""Returns the filename for a stage of inference."""
if stage not in [PREPROCESSOR, PREDICTOR, POSTPROCESSOR]:
raise ValueError(f"Invalid stage: {stage}.")
# device is only tracked for predictor stage
if stage == PREDICTOR:
return f"inference_{stage}-{device}.pt"
else:
return f"inference_{stage}.pt"
def to_inference_module_input_from_dataframe(
dataset: pd.DataFrame, config: ModelConfigDict, load_paths: bool = False, device: torch.device | None = None
) -> dict[str, TorchscriptPreprocessingInput]:
"""Converts a pandas DataFrame to be compatible with a torchscripted InferenceModule forward pass."""
inputs = {}
for if_config in config["input_features"]:
feature_inputs = to_inference_model_input_from_series(
dataset[if_config[COLUMN]],
if_config[TYPE],
load_paths=load_paths,
feature_config=if_config,
)
feature_inputs = place_on_device(feature_inputs, device)
inputs[if_config[NAME]] = feature_inputs
return inputs
def to_inference_model_input_from_series(
s: pd.Series, feature_type: str, load_paths: bool = False, feature_config: FeatureConfigDict | None = None
) -> TorchscriptPreprocessingInput:
"""Converts a pandas Series to be compatible with a torchscripted InferenceModule forward pass."""
if feature_type == IMAGE:
if load_paths:
return [read_image_from_path(v) if isinstance(v, str) else v for v in s]
elif feature_type == AUDIO:
if load_paths:
return [read_audio_from_path(v) if isinstance(v, str) else v for v in s]
elif feature_type == DATE:
if feature_config is None:
raise ValueError('"date" feature type requires the associated feature config to be provided.')
datetime_format = feature_config["preprocessing"]["datetime_format"]
return [torch.tensor(create_vector_from_datetime_obj(datetime.strptime(v, datetime_format))) for v in s]
elif feature_type in FEATURES_TO_CAST_AS_STRINGS:
return s.astype(str).to_list()
return torch.from_numpy(s.to_numpy())
================================================
FILE: ludwig/utils/llm_quantization_utils.py
================================================
import torch
from torch import nn
try:
from bitsandbytes.functional import dequantize_4bit
from bitsandbytes.nn.modules import Linear4bit
except Exception:
dequantize_4bit = None
Linear4bit = None
from ludwig.api_annotations import DeveloperAPI
@DeveloperAPI
def linear4bit_to_linear(linear4bit_layer):
"""Converts a Linear4Bit layer to a standard Linear layer by dequantizing the weight values and copying the
dequantized weights to a new Linear layer.
Args:
linear4bit_layer (Linear4bit): The input Linear4Bit layer.
Returns:
nn.Linear: A new Linear layer with dequantized weights and biases.
"""
# Create a new Linear layer with the same shape
new_linear_layer = nn.Linear(
linear4bit_layer.in_features,
linear4bit_layer.out_features,
bias=linear4bit_layer.bias is not None,
dtype=torch.float16,
)
# Dequantize the weight and bias from the Linear4bit layer and perform an in-place tensor replacement
# to update the weights and bias in the new Linear layer. This is done to avoid creating a new tensor
# and copying the data, which is slow.
new_linear_layer.weight.data.copy_(
dequantize_4bit(linear4bit_layer.weight.data, linear4bit_layer.weight.quant_state)
)
if linear4bit_layer.bias is not None:
new_linear_layer.bias.data.copy_(linear4bit_layer.bias.data)
return new_linear_layer
@DeveloperAPI
def convert_quantized_linear_to_linear(module):
"""Recursively converts Linear4Bit layers to standard Linear layers in a given module.
Args:
module (nn.Module): The input module containing potentially nested Linear4Bit layers.
Returns:
None
"""
for name, child in module.named_children():
if isinstance(child, Linear4bit):
# Replace Linear4Bit layer with a new Linear layer
setattr(module, name, linear4bit_to_linear(child))
else:
# Recursively apply the conversion for nested modules
convert_quantized_linear_to_linear(child)
================================================
FILE: ludwig/utils/llm_utils.py
================================================
import copy
import logging
import tempfile
from typing import TYPE_CHECKING, Union
import torch
import torch.nn.functional as F
import transformers
from packaging import version
try:
from bitsandbytes.nn.modules import Embedding as BnbEmbedding
except Exception:
BnbEmbedding = None
from transformers import AutoConfig, AutoModelForCausalLM, PreTrainedModel, PreTrainedTokenizer, TextStreamer
from ludwig.constants import IGNORE_INDEX_TOKEN_ID, LOGITS, PREDICTIONS, PROBABILITIES
from ludwig.schema.trainer import LLMTrainerConfig
from ludwig.utils.error_handling_utils import default_retry
from ludwig.utils.logging_utils import log_once
from ludwig.utils.model_utils import find_embedding_layer_with_path
if TYPE_CHECKING:
from ludwig.schema.encoders.text_encoders import LLMEncoderConfig
from ludwig.schema.model_types.llm import LLMModelConfig
logger = logging.getLogger(__name__)
transformers_436 = version.parse(transformers.__version__) >= version.parse("4.36.0")
FALLBACK_CONTEXT_LEN = 2048
_MODELS_WITH_DEVICE_MAP_AUTO_EXCLUSION = set()
@default_retry(tries=8)
def load_pretrained_from_config(
config_obj: Union["LLMModelConfig", "LLMEncoderConfig"],
model_config: AutoConfig | None = None,
weights_save_path: str | None = None,
) -> PreTrainedModel:
load_kwargs = {}
if config_obj.quantization:
# Apply quantization configuration at model load time
load_kwargs["dtype"] = getattr(torch, config_obj.quantization.bnb_4bit_compute_dtype)
load_kwargs["quantization_config"] = config_obj.quantization.to_bitsandbytes()
load_kwargs["device_map"] = "auto"
if transformers_436:
load_kwargs["attn_implementation"] = "eager"
else:
# Load in float32 by default to avoid CUBLAS errors with small hidden sizes
# and to ensure numerical stability during training without mixed-precision.
load_kwargs["dtype"] = torch.float32
config_modified = False
if config_obj.model_parameters:
# Add any model specific parameters to the load kwargs
for param_name, param_value in config_obj.model_parameters.to_dict().items():
# Not all parameters are supported by all models, so we only add the parameter to the load kwargs
# if it is supported by the model.
if param_value is None:
continue
if hasattr(model_config, param_name):
if isinstance(param_value, dict):
# For nested dict params (e.g. rope_scaling), merge with existing
# config values to preserve defaults like rope_theta.
existing = getattr(model_config, param_name, {}) or {}
existing.update(param_value)
setattr(model_config, param_name, existing)
config_modified = True
else:
load_kwargs[param_name] = param_value
else:
logger.warning(f"Parameter {param_name} is not supported by {config_obj.base_model}. Skipping.")
# Only pass config= when we've directly modified it (e.g. rope_scaling merge).
if config_modified:
load_kwargs["config"] = model_config
logger.info("Loading large language model...")
pretrained_model_name_or_path = weights_save_path or config_obj.base_model
trust_remote_code = getattr(config_obj, "trust_remote_code", False)
model: PreTrainedModel = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **load_kwargs
)
return model
def to_device(
model: PreTrainedModel,
device: str | torch.DeviceObjType,
config_obj: "LLMModelConfig", # noqa F821
curr_device: torch.DeviceObjType,
) -> tuple[PreTrainedModel, torch.DeviceObjType]:
"""Move an LLM to the requested device, accounting for sharding and adapters.
Args:
model: Pretrained model to put on device
config_obj: LLM config
curr_device: The current device that the model is on
Returns:
`model` moved to `device`
"""
device = torch.device(device)
if device.type == curr_device.type:
log_once(f"Model already on device'{device}'.")
return model, device
else:
log_once(f"Moving LLM from '{curr_device}' to '{device}'.")
model_kwargs = {}
num_gpus = torch.cuda.device_count()
if device == torch.device("cuda") and num_gpus > 1:
# TODO: make this configurable in the future. These parameters are from FastChat:
# https://github.com/lm-sys/FastChat/blob/0e958b852a14f4bef5f0e9d7a5e7373477329cf2/fastchat/serve/inference.py#L90 # noqa
# TODO: Wrap device_map="auto" in a try-except block since it may not be supported for all models (E.g. BertLMHead) # noqa
# We don't add quantization here (float16 or bfloat16) since we may not always want to quantize. We should
# make quantization configurable in the future via the trainer config.
model_kwargs.update(
dict(
low_cpu_mem_usage=True,
max_memory={i: "13GiB" for i in range(num_gpus)},
)
)
if config_obj.base_model not in _MODELS_WITH_DEVICE_MAP_AUTO_EXCLUSION:
model_kwargs["device_map"] = "auto"
if config_obj.quantization:
model_kwargs["quantization_config"] = config_obj.quantization.to_bitsandbytes()
# we save and reload the weights to ensure that they can be sharded across the GPUs using `from_pretrained`
with tempfile.TemporaryDirectory() as tmpdir:
model.save_pretrained(tmpdir)
if config_obj.adapter:
model = AutoModelForCausalLM.from_pretrained(
config_obj.base_model,
trust_remote_code=getattr(config_obj, "trust_remote_code", False),
**model_kwargs,
)
# Leave this import inline to support a minimal install of Ludwig
from peft import PeftModel # noqa
model = PeftModel.from_pretrained(
model,
tmpdir,
torch_dtype=torch.float16,
)
else:
model = AutoModelForCausalLM.from_pretrained(
tmpdir,
trust_remote_code=getattr(config_obj, "trust_remote_code", False),
**model_kwargs,
)
else:
model = model.to(device)
return model, device
def _load_peft_config(pretrained_adapter_weights: str):
"""Load a PeftConfig, fixing known compatibility issues with newer PEFT versions."""
import json
from huggingface_hub import hf_hub_download
from peft import PeftConfig
config_file = hf_hub_download(pretrained_adapter_weights, "adapter_config.json")
with open(config_file) as f:
config_dict = json.load(f)
# AdaLoRA requires total_step > 0 in newer PEFT versions, but pretrained
# configs may have total_step=None.
if config_dict.get("peft_type") == "ADALORA" and not config_dict.get("total_step"):
config_dict["total_step"] = 10000
return PeftConfig.from_peft_type(**config_dict)
def initialize_adapter(
model: PreTrainedModel, config_obj: "LLMModelConfig" # noqa F821
) -> Union["PeftModel", PreTrainedModel]: # noqa F821
"""Wrap a pretrained model with a PEFT model for fine-tuning.
Args:
model: Pretrained model to fine-tune with an adapter.
config_obj: LLM config
Returns:
`model` wrapped in a PEFT model if an adapter config was provided, otherwise `model`.
"""
# Only load a PEFT model if the config specifies an adapter, otherwise return the model unaltered.
if config_obj.adapter:
if config_obj.adapter.pretrained_adapter_weights:
# Load pretrained adapter weights if specified.
logger.info(f"Using pretrained adapter weights: {config_obj.adapter.pretrained_adapter_weights}")
# Leave this import inline to support a minimal install of Ludwig
from peft import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PeftConfig # noqa
peft_config = _load_peft_config(config_obj.adapter.pretrained_adapter_weights)
model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type].from_pretrained(
model, config_obj.adapter.pretrained_adapter_weights, config=peft_config
)
else:
# Leave this import inline to support a minimal install of Ludwig
from peft import get_peft_model, TaskType # noqa
# If no pretrained adapter is provided, we want to load untrained weights into the model
peft_config = config_obj.adapter.to_config(
task_type=TaskType.CAUSAL_LM, tokenizer_name_or_path=config_obj.base_model
)
model = get_peft_model(model, peft_config)
return model
def get_context_len(model_config: AutoConfig):
"""Determines the maximum length of the context (input + output tokens) based on the provided model
configuration.
Args:
model_config (AutoConfig): The model configuration object containing information about the model's properties.
Returns:
int: The maximum context length, which can be derived from the model configuration. If no relevant attribute
is found, the default value of 2048 is returned.
This function examines the provided model configuration object to identify the attribute that specifies the maximum
context length. It checks for attributes in the following order of preference:
1. 'max_sequence_length': If this attribute is present in the model configuration, its value is returned.
2. 'max_position_embeddings': If 'max_sequence_length' is not found but 'max_position_embeddings' is present, its
value is returned.
3. 'n_positions': If neither 'max_sequence_length' nor 'max_position_embeddings' are found, and 'n_positions' is
present, its value is returned.
4. Default: If none of the relevant attributes are present, the function returns a default value of 2048.
Note:
- The maximum context length is important for defining the size of input and output sequences in a model.
Example Usage:
>>> config = AutoConfig.from_pretrained("bert-base-uncased")
>>> context_len = get_context_len(config)
>>> print(context_len)
512
"""
if hasattr(model_config, "max_sequence_length"):
return model_config.max_sequence_length
elif hasattr(model_config, "max_position_embeddings"):
return model_config.max_position_embeddings
elif hasattr(model_config, "n_positions"):
return model_config.n_positions
else:
return FALLBACK_CONTEXT_LEN
def has_padding_token(input_tensor: torch.Tensor, tokenizer: PreTrainedTokenizer):
"""Checks if the input tensor contains any padding tokens.
Args:
input_tensor (torch.Tensor): The input tensor.
tokenizer (PreTrainedTokenizer): The tokenizer used to encode the input.
Returns:
bool: True if the input tensor contains any padding tokens, False otherwise.
Example:
>>> import torch
>>> from transformers import PreTrainedTokenizer
>>> tokenizer = PreTrainedTokenizer.from_pretrained('bert-base-uncased')
>>> input_sentence = "This is an example sentence."
>>> input_ids = tokenizer.encode(input_sentence, add_special_tokens=True)
>>> padded_input_ids = torch.nn.functional.pad(input_ids, (0, 10 - len(input_ids)))
>>> has_padding = has_padding_token(padded_input_ids, tokenizer)
>>> has_padding
True
"""
if input_tensor.dim() == 1:
return torch.any(input_tensor == tokenizer.pad_token_id).item()
elif input_tensor.dim() == 2:
return torch.any(input_tensor == tokenizer.pad_token_id, dim=-1).item()
else:
raise ValueError("Input tensor must be 1D or 2D")
def remove_left_padding(input_ids_sample: torch.Tensor, tokenizer: PreTrainedTokenizer):
"""Removes left padding and other tokens until the first BOS token from the input_ids tensor.
Args:
input_ids_sample (torch.Tensor): The input tensor with padding and other tokens.
tokenizer (PreTrainedTokenizer): The tokenizer used to encode the input.
Returns:
torch.Tensor: The input tensor without left padding and other tokens until the first BOS token.
Example:
>>> import torch
>>> from transformers import PreTrainedTokenizer
>>> tokenizer = PreTrainedTokenizer.from_pretrained('bert-base-uncased')
>>> input_sentence = "This is an example sentence."
>>> input_ids = tokenizer.encode(input_sentence, add_special_tokens=True)
>>> padded_input_ids = torch.nn.functional.pad(input_ids, (10 - len(input_ids), 0))
>>> input_ids_no_padding = remove_left_padding(padded_input_ids, tokenizer)
>>> input_ids_no_padding
tensor([[1, 2, 3]])
"""
# Remove all PAD tokens
pad_idxs = torch.where(input_ids_sample == tokenizer.pad_token_id)[0] # all PAD token locations
input_ids_no_padding = input_ids_sample
if len(pad_idxs) != 0:
pad_idx = pad_idxs[-1] # get last PAD token location
input_ids_no_padding = input_ids_sample[pad_idx + 1 :]
# Start from the first BOS token
bos_idxs = torch.where(input_ids_no_padding == tokenizer.bos_token_id)[0] # all BOS token locations
if len(bos_idxs) != 0:
bos_idx = bos_idxs[0] # get first BOS token location
else:
bos_idx = 0
input_ids_no_bos = input_ids_no_padding[bos_idx:].unsqueeze(0)
return input_ids_no_bos
def add_left_padding(input_ids, max_length, pad_value=0):
"""Adds left padding to the input_ids tensor.
Args:
input_ids (torch.Tensor): The input tensor.
max_length (int): The maximum length of the tensor after padding.
pad_value (int, optional): The value used for padding. Defaults to 0.
Returns:
torch.Tensor: The input_ids tensor with left padding.
Example:
>>> input_ids = torch.tensor([1, 2, 3])
>>> max_length = 5
>>> padded_tensor = add_left_padding(input_ids, max_length)
>>> padded_tensor
tensor([0, 0, 1, 2, 3])
"""
padding = torch.tensor([pad_value] * (max_length - input_ids.shape[0]), dtype=torch.int64, device=input_ids.device)
return torch.cat((padding, input_ids), dim=-1)
def create_attention_mask(input_ids: torch.Tensor, tokenizer: PreTrainedTokenizer):
"""Creates an attention mask for the input_ids tensor. This also sets the last padding token ID to 1 if it
exists.
Args:
input_ids (torch.Tensor): The input tensor.
tokenizer (PreTrainedTokenizer): The tokenizer used to encode the input.
Returns:
torch.Tensor: The attention mask tensor.
Example:
>>> import torch # noqa
>>> from transformers import PreTrainedTokenizer
>>> tokenizer = PreTrainedTokenizer.from_pretrained('bert-base-uncased')
>>> input_sentence = "This is an example sentence."
>>> input_ids = tokenizer.encode(input_sentence, add_special_tokens=True)
>>> attention_mask = create_attention_mask(input_ids, tokenizer)
>>> attention_mask
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
"""
attention_mask = input_ids != tokenizer.pad_token_id
# Last token may not be padding if we've already hit the max sequence length
if not attention_mask[-1]:
# last token is padding, always attended to even if it is padding
attention_mask[-1] = 1
attention_mask = attention_mask.to(torch.int64)
return attention_mask
def find_last_matching_index(tensor_a: torch.Tensor, tensor_b: torch.Tensor):
"""Returns the last index of `tensor_a` that matches `tensor_b`. Specifically, this checks whether the tensor_b
is in the last tensor_b.shape[0] elements of tensor_a.
Args:
tensor_a (torch.Tensor): The first tensor.
tensor_b (torch.Tensor): The second tensor.
Returns:
int: The last index of `tensor_a` that matches `tensor_b`. Returns -1 if there is no matching index.
Example:
>>> import torch
>>> tensor_a = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8])
>>> tensor_b = torch.tensor([6, 7, 8])
>>> last_matching_index = find_last_matching_index(tensor_a, tensor_b)
>>> last_matching_index
5
"""
last_index = -1
tensor_a_length = tensor_a.shape[0]
tensor_b_length = tensor_b.shape[0]
# Get the last tensor_b_length elements of tensor_a.
tensor_a_truncated = tensor_a[-tensor_b_length:]
# Find the last matching index.
for i in range(tensor_b_length):
if torch.equal(tensor_a_truncated[i:], tensor_b[: tensor_b_length - i]):
last_index = tensor_a_length - tensor_b_length + i
break
return last_index
def pad_target_tensor_for_fine_tuning(
targets: dict[str, torch.Tensor],
predictions: dict[str, torch.Tensor],
model_inputs: torch.Tensor,
of_name: str,
) -> dict[str, torch.Tensor]:
"""Pad and adjust target tensors for fine-tuning LLMS models.
This function is used to pad and adjust the target tensors with IGNORE_INDEX_TOKEN_ID based on the model inputs and
predictions during the fine-tuning process of Language Models. Here's what this function does:
1. If none of the tokens from the target were in the model inputs, we create a tensor of the length of model
inputs with value IGNORE_INDEX_TOKEN_IDs. This ignores this row from affecting loss.
2. If the target tokens were entirely inside the model inputs, we want to pad all the tokens in model_inputs
coming from the input with IGNORE_INDEX_TOKEN_IDs and leave the target tokens as is. This ensures that all
of the target tokens are used during loss computation.
3. In the scenario that only some part of the target tokens were in the model inputs, we want to pad the model
inputs until that point and only leave the partial tokens of the target as is. This ensures that we will
only compute loss on the target tokens that were in the model inputs.
Args:
targets (Dict[str, torch.Tensor]): A dictionary containing the target tensors.
predictions (Dict[str, torch.Tensor]): A dictionary containing the predicted tensors.
model_inputs (torch.Tensor): The input tensor passed into the model's forward pass.
of_name (str): The name of the target tensor to be padded and adjusted.
Returns:
Dict[str, torch.Tensor]: A dictionary containing the updated target
dictionaries.
"""
target_length = targets.get(of_name).size()[1]
prediction_length = predictions[of_name].get(PREDICTIONS).size()[1]
if target_length == prediction_length:
return targets
updated_targets = []
for idx, target in enumerate(targets[of_name]):
# Remove any leading IGNORE_INDEX_TOKEN_IDs in the target that were temporarily added for alignment
end_index = (target != IGNORE_INDEX_TOKEN_ID).nonzero()[0]
target = target[end_index:]
target_device = target.device
# See if any part of the target was in the tensor passed into the model's forward pass
last_matching_index = find_last_matching_index(model_inputs[idx], target)
# If the last matching index is -1, it means that the input tensor passed into the model was truncated
# and did not contain the target tensor. In this case, we need to truncate the target tensors as well
# and just set it to a tensor of IGNORE_INDEX_TOKEN_ID so that we don't compute loss on this target tensor.
if last_matching_index == -1:
updated_targets.append(torch.full((prediction_length,), IGNORE_INDEX_TOKEN_ID).to(device=target_device))
# If the last matching index is not -1, it means that the input tensor passed into the model was not
# truncated and contained either a part of the target tensor or the entire target tensor. In this case,
# we need to set the target tensor to the part of the target tensor that was passed into the model while
# also padding it to the correct length with IGNORE_INDEX_TOKEN_ID.
else:
padding = torch.full((last_matching_index,), IGNORE_INDEX_TOKEN_ID).to(device=target_device)
updated_targets.append(torch.cat((padding, target), dim=-1)[:prediction_length])
targets[of_name] = torch.stack(updated_targets).to(device=targets.get(of_name).device, dtype=torch.int64)
return targets
def generate_merged_ids(
input_ids: torch.tensor, target_ids: torch.tensor, tokenizer: PreTrainedTokenizer, max_sequence_length: int = None
):
"""Generate merged input and target IDs tensor.
This function merges the input_ids and target_ids together to create a unified tensor
to pass into the model. It also returns attention masks for the merged tensors.
Args:
input_ids (torch.Tensor): The input IDs tensor.
target_ids (torch.Tensor or None): The target IDs tensor or None.
max_sequence_length (int or None): The maximum sequence length to pad or truncate to.
tokenizer (PreTrainedTokenizer): The tokenizer used to encode the input_ids and target_ids.
Returns:
torch.Tensor: The merged input and target IDs tensor.
torch.Tensor: The attention masks for the merged tensor.
"""
merged_input_and_targets = []
lengths = []
eos_tensor = torch.tensor([tokenizer.eos_token_id]).to(target_ids[0].device)
# Merge input_ids and target_ids by concatenating them together.
# We remove the left padding from both input_ids and target_ids before concatenating them.
for input_id_sample, target_id_sample in zip(input_ids, target_ids):
input_id_sample_no_padding = remove_left_padding(input_id_sample, tokenizer)[0]
target_id_sample_no_padding = remove_left_padding(target_id_sample, tokenizer)[0]
target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, eos_tensor), dim=-1)
merged_sample_ids = torch.cat((input_id_sample_no_padding, target_id_sample_no_padding), dim=-1)
# If the merged tensor is longer than the maximum sequence length, we truncate it.
if max_sequence_length and merged_sample_ids.shape[0] > max_sequence_length:
merged_sample_ids = merged_sample_ids[:max_sequence_length]
merged_input_and_targets.append(merged_sample_ids)
lengths.append(merged_sample_ids.shape[0])
# Since we remove the left padding from the target_ids, the merged input_ids and target_ids
# may not have the same lengths. We need to align them to the same length by adding left padding
# and generate an attention mask for just the part of the input that is not padding.
max_length = max(lengths)
attention_masks = []
for i, merged_sample_ids in enumerate(merged_input_and_targets):
merged_input_and_targets[i] = add_left_padding(merged_sample_ids, max_length)
attention_masks.append(create_attention_mask(merged_input_and_targets[i], tokenizer))
return torch.stack(merged_input_and_targets), torch.stack(attention_masks)
def _get_decoded_targets_and_predictions(
targets: dict[str, torch.Tensor],
predictions: dict[str, dict[str, torch.Tensor]],
tokenizer: PreTrainedTokenizer,
of_name: str,
):
"""Returns the decoded targets and predictions, accounting for IGNORE_INDEX_TOKEN_ID."""
target_tensor = targets[of_name]
pred_tensor = predictions[of_name][PREDICTIONS]
# Ensure targets and predictions are on the same device
if target_tensor.device != pred_tensor.device:
target_tensor = target_tensor.to(pred_tensor.device)
sanitized_targets = torch.where(target_tensor != IGNORE_INDEX_TOKEN_ID, target_tensor, tokenizer.pad_token_id)
sanitized_predictions = torch.where(
pred_tensor != IGNORE_INDEX_TOKEN_ID,
pred_tensor,
tokenizer.pad_token_id,
)
decoded_targets = tokenizer.batch_decode(sanitized_targets, skip_special_tokens=True)
decoded_predictions = tokenizer.batch_decode(sanitized_predictions, skip_special_tokens=True)
return decoded_targets, decoded_predictions
def get_realigned_target_and_prediction_tensors_for_inference(
targets: dict[str, torch.Tensor],
predictions: dict[str, dict[str, torch.Tensor]],
of_name: str,
tokenizer: PreTrainedTokenizer,
pad_value: int = None,
) -> tuple[dict[str, torch.Tensor], dict[str, torch.Tensor]]:
"""Realigns the target tensor with the predictions.
This is necessary for text metrics that require the target and prediction to be of the same length.
Args:
targets: The target tensor.
predictions: The prediction tensor.
of_name: The output feature's name.
tokenizer: The HF tokenizer.
pad_direction: The direction to pad the tensors. Can be 'left' or 'right'.
Defaults to 'right'.
Returns:
Tuple of realigned (targets, decoded_targets, predictions, decoded_predictions).
- targets is a map of feature name -> tensor of token ids.
- predictions is a map from output feature name -> map of tensors with the following items:
- "predictions": tensor of token ids.
- "probabilities": tensor of probabilities.
- "logits": tensor of logits.
"""
target_length = targets.get(of_name).size()[1]
prediction_length = predictions[of_name].get(PREDICTIONS).size()[1]
if target_length == prediction_length:
return targets, predictions
if not pad_value:
pad_value = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
zeros_to_add = (
target_length - prediction_length if target_length > prediction_length else prediction_length - target_length
)
# We don't want to modify the original targets and predictions tensors, so we create a copy of them.
_targets = copy.deepcopy(targets)
_predictions = copy.deepcopy(predictions)
# Align target and prediction tensors for text to text metric computation
if target_length > prediction_length:
# Pad the predictions.
_predictions[of_name][PREDICTIONS] = F.pad(
_predictions[of_name][PREDICTIONS], (0, zeros_to_add), value=pad_value
).to(torch.int64)
_predictions[of_name][PROBABILITIES] = F.pad(_predictions[of_name][PROBABILITIES], (0, 0, 0, zeros_to_add)).to(
torch.float32
)
_predictions[of_name][LOGITS] = F.pad(_predictions[of_name][LOGITS], (0, 0, 0, zeros_to_add)).to(torch.float32)
else:
_targets[of_name] = F.pad(_targets[of_name], (0, zeros_to_add), value=pad_value).to(torch.int64)
return _targets, _predictions
def update_embedding_layer(model: AutoModelForCausalLM, config_obj: LLMTrainerConfig) -> AutoModelForCausalLM:
"""Updates the embedding layer of the model to use the 8-bit embedding layer from bitsandbytes.nn.modules.
This is necessary when using 8-bit optimizers from bitsandbytes.
See: https://github.com/TimDettmers/bitsandbytes#tldr
"""
# If we're using an 8-bit optimizer, we need to replace the embedding layer with a custom embedding layer from
# bnb.nn.modules.Embedding.
if hasattr(config_obj, "optimizer") and config_obj.optimizer.is_8bit:
embedding_layer, module_path = find_embedding_layer_with_path(model)
if embedding_layer is None:
raise ValueError(
"Could not find an embedding layer in the model. This is required when using 8-bit optimizers"
" since a custom 8-bit embedding layer is used in place of the original embedding layer."
)
# Initialize the BNB embedding layer with the same parameters and weights as the original embedding layer.
bnb_embedding = BnbEmbedding(
num_embeddings=embedding_layer.num_embeddings,
embedding_dim=embedding_layer.embedding_dim,
padding_idx=embedding_layer.padding_idx,
max_norm=embedding_layer.max_norm,
norm_type=embedding_layer.norm_type,
scale_grad_by_freq=embedding_layer.scale_grad_by_freq,
sparse=embedding_layer.sparse,
_weight=embedding_layer.weight,
device=model.device,
)
# Update the model's original embedding layer to use the BNB embedding layer using the module_path
# returned by find_embedding_layer_with_path.
module_path = module_path.split(".")
module = model
for module_name in module_path[:-1]:
module = getattr(module, module_name)
setattr(module, module_path[-1], bnb_embedding)
# Set the get input embeddings lambda function to return the BNB embedding layer
model.get_input_embeddings = lambda: bnb_embedding
logger.info("Updated the pretrained embedding layer to use the embedding layer from bitsandbytes.")
return model
def create_text_streamer(tokenizer: PreTrainedTokenizer) -> TextStreamer:
"""Creates a TextStreamer object for streaming text to stdout during generation."""
return TextStreamer(tokenizer=tokenizer, skip_prompt=True)
================================================
FILE: ludwig/utils/logging_utils.py
================================================
_logged = set()
def log_once(key: str) -> bool:
"""Returns True if this is the "first" call for a given key.
Example:
if log_once("some_key"):
logger.info("Some verbose logging statement") # noqa
"""
if key not in _logged:
_logged.add(key)
return True
return False
================================================
FILE: ludwig/utils/loss_utils.py
================================================
import torch
def rmspe_loss(targets: torch.Tensor, predictions: torch.Tensor) -> torch.Tensor:
"""Root mean square percentage error.
Bad predictions can lead to arbitrarily large RMSPE values, especially if some values of targets are very close to
zero. We return a large value instead of inf when (some) targets are zero.
"""
epsilon = 1e-4
# add epsilon if targets are zero to avoid division by zero
denominator = targets + epsilon * (targets == 0).float()
loss = torch.sqrt(torch.mean(((targets - predictions).float() / denominator) ** 2))
return loss
def mean_confidence_penalty(probabilities: torch.Tensor, num_classes: int) -> torch.Tensor:
max_entropy = torch.log(torch.tensor(num_classes))
# clipping needed for avoiding log(0) = -inf
entropy_per_class, _ = torch.max(-probabilities * torch.log(torch.clamp(probabilities, 1e-10, 1)), dim=0)
entropy = torch.sum(entropy_per_class, -1)
penalty = (max_entropy - entropy) / max_entropy
return torch.mean(penalty)
================================================
FILE: ludwig/utils/math_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import math
import numpy as np
def softmax(x, temperature=1.0):
e_x = np.exp((x - np.max(x)) / temperature)
return e_x / e_x.sum()
def int_type(number):
if number <= np.iinfo(np.int8).max:
return np.int8
elif number <= np.iinfo(np.int16).max:
return np.int16
elif number <= np.iinfo(np.int32).max:
return np.int32
else: # if number <= np.iinfo(np.int64).max:
return np.int64
def convert_size(size_bytes):
if size_bytes == 0:
return "0B"
size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
i = int(math.floor(math.log(size_bytes, 1024)))
p = math.pow(1024, i)
s = round(size_bytes / p, 2)
return f"{s} {size_name[i]}"
def round2precision(val, precision: int = 0, which: str = ""):
assert precision >= 0
val *= 10**precision
round_callback = round
if which.lower() == "up":
round_callback = math.ceil
if which.lower() == "down":
round_callback = math.floor
return "{1:.{0}f}".format(precision, round_callback(val) / 10**precision)
def cumsum(x: list[int]) -> list[int]:
results = []
j = 0
for i in range(0, len(x)):
j += x[i]
results.append(j)
return results
================================================
FILE: ludwig/utils/metric_utils.py
================================================
from collections import defaultdict, namedtuple
import torch
from torch import Tensor
from torchmetrics.metric import Metric
from ludwig.constants import COMBINED, LOSS, NAME, TYPE
from ludwig.modules.metric_registry import get_metric_names_for_type
from ludwig.types import FeatureConfigDict
def sequence_mask(lengths: Tensor, maxlen: int | None = None, dtype=torch.bool) -> Tensor:
"""Implements tf.sequence_mask in torch.
From https://discuss.pytorch.org/t/pytorch-equivalent-for-tf-sequence-mask/39036/2.
"""
if maxlen is None:
maxlen = lengths.max()
row_vector = torch.arange(0, maxlen, 1).to(lengths.device)
matrix = torch.unsqueeze(lengths, dim=-1)
mask = row_vector < matrix
return mask.type(dtype)
def dynamic_partition(data: Tensor, partitions: Tensor, num_partitions: int) -> list[Tensor]:
"""Implements tf.dynamic_partition in torch.
From https://discuss.pytorch.org/t/equivalent-of-tf-dynamic-partition/53735.
"""
assert data.size() == partitions.size()
# Flatten data into 1D vectors to do partitioning correctly.
data = data.view(-1)
partitions = partitions.view(-1)
result = []
for i in range(num_partitions):
result += [data[(partitions == i).nonzero().squeeze(1)]]
return result
def masked_correct_predictions(targets: Tensor, preds: Tensor, targets_sequence_lengths: Tensor) -> Tensor:
"""Masks out special symbols, and returns tensor of correct predictions.
Args:
targets: 2D tensor [batch_size, sequence_length]
preds: 2D tensor [batch_size, sequence_length]
Returns:
1D tensor of all correct predictions.
"""
correct_preds = preds == targets
mask = sequence_mask(lengths=targets_sequence_lengths, maxlen=correct_preds.shape[1], dtype=torch.int32)
_, masked_correct_preds = dynamic_partition(data=correct_preds, partitions=mask, num_partitions=2)
return masked_correct_preds.type(torch.float32)
def get_scalar_from_ludwig_metric(metric: Metric) -> float:
"""Returns the scalar value of a Ludwig metric.
Params:
metric: Metric object
Returns:
float: scalar value of the metric
"""
return metric.compute().detach().cpu().numpy().item()
# Data for training and evaluation metrics.
TrainerMetric = namedtuple("TrainerMetric", ("epoch", "step", "value"))
def reduce_trainer_metrics_dict(
dict_dict_trainer_metrics: dict[str, dict[str, list[TrainerMetric]]],
) -> dict[str, dict[str, list[float]]]:
"""Reduces Dict[feature_name, Dict[metric_name, List[TrainerMetric]]] to Dict[feature_name, Dict[metric_name,
List[float]]].
Used for flattening the results returned by trainer.py::train(), which come from ProgressTracker.
"""
flattened_dict = defaultdict(lambda: defaultdict(list))
for feature_name, trainer_metric_dict in dict_dict_trainer_metrics.items():
for metric_name, trainer_metrics in trainer_metric_dict.items():
for trainer_metric in trainer_metrics:
flattened_dict[feature_name][metric_name].append(trainer_metric[-1])
# Convert defaultdict to dict so JSON serialization works with dataclasses.asdict().
return {k: dict(v) for k, v in flattened_dict.items()}
def get_metric_names(output_features: dict[str, "OutputFeature"]) -> dict[str, list[str]]: # noqa
"""Returns a dict of output_feature_name -> list of metric names."""
metrics_names = {}
for output_feature_name, output_feature in output_features.items():
metrics_names[output_feature_name] = sorted(list(get_metric_names_for_type(output_feature.type())))
# Add combined loss.
metrics_names[COMBINED] = [LOSS]
return metrics_names
def get_feature_to_metric_names_map(output_features: list[FeatureConfigDict]) -> dict[str, list[str]]:
"""Returns a dict of output_feature_name -> list of metric names."""
metrics_names = {}
for output_feature in output_features:
output_feature_name = output_feature[NAME]
output_feature_type = output_feature[TYPE]
metrics_names[output_feature_name] = get_metric_names_for_type(output_feature_type)
metrics_names[COMBINED] = [LOSS]
return metrics_names
def get_feature_to_metric_names_map_from_feature_collection(
output_features: "FeatureCollection", # noqa
) -> dict[str, list[str]]:
"""Returns a dict of output_feature_name -> list of metric names."""
metrics_names = {
output_feature.name: get_metric_names_for_type(output_feature.type) for output_feature in output_features
}
metrics_names[COMBINED] = [LOSS]
return metrics_names
================================================
FILE: ludwig/utils/metrics_printed_table.py
================================================
import logging
from tabulate import tabulate
from ludwig.constants import COMBINED, LOSS
from ludwig.utils.metric_utils import TrainerMetric
logger = logging.getLogger(__name__)
def get_metric_value_or_empty(metrics_log: dict[str, list[TrainerMetric]], metric_name: str):
"""Returns the metric value if it exists or empty."""
if metric_name not in metrics_log:
return ""
return metrics_log[metric_name][-1][-1]
def print_table_for_single_output_feature(
train_metrics_log: dict[str, list[TrainerMetric]],
validation_metrics_log: dict[str, list[TrainerMetric]],
test_metrics_log: dict[str, list[TrainerMetric]],
combined_loss_for_each_split: list[float],
) -> None:
"""Prints the metrics table for a single output feature.
Args:
train_metrics_log: Dict from metric name to list of TrainerMetric.
validation_metrics_log: Dict from metric name to list of TrainerMetric.
test_metrics_log: Dict from metric name to list of TrainerMetric.
"""
# Get the superset of metric names across all splits.
all_metric_names = set()
all_metric_names.update(train_metrics_log.keys())
all_metric_names.update(validation_metrics_log.keys())
all_metric_names.update(test_metrics_log.keys())
all_metric_names = sorted(list(all_metric_names))
# Assemble the printed table.
# Each item in the printed_table corresponds to a row in the printed table.
printed_table = [["train", "validation", "test"]]
for metric_name in all_metric_names:
metrics_for_each_split = [
get_metric_value_or_empty(train_metrics_log, metric_name),
get_metric_value_or_empty(validation_metrics_log, metric_name),
get_metric_value_or_empty(test_metrics_log, metric_name),
]
printed_table.append([metric_name] + metrics_for_each_split)
# Add combined loss.
printed_table.append(["combined_loss"] + combined_loss_for_each_split)
logger.info(tabulate(printed_table, headers="firstrow", tablefmt="fancy_grid", floatfmt=".4f"))
def print_metrics_table(
output_features: dict[str, "OutputFeature"], # noqa
train_metrics_log: dict[str, dict[str, list[TrainerMetric]]],
validation_metrics_log: dict[str, dict[str, list[TrainerMetric]]],
test_metrics_log: dict[str, dict[str, list[TrainerMetric]]],
):
"""Prints a table of metrics table for each output feature, for each split.
Example:
╒═══════════════╤═════════╤══════════════╤════════╕
│ │ train │ validation │ test │
╞═══════════════╪═════════╪══════════════╪════════╡
│ accuracy │ 0.8157 │ 0.6966 │ 0.8090 │
├───────────────┼─────────┼──────────────┼────────┤
│ loss │ 0.4619 │ 0.5039 │ 0.4488 │
├───────────────┼─────────┼──────────────┼────────┤
│ precision │ 0.8274 │ 0.6250 │ 0.7818 │
├───────────────┼─────────┼──────────────┼────────┤
│ recall │ 0.6680 │ 0.4545 │ 0.6615 │
├───────────────┼─────────┼──────────────┼────────┤
│ roc_auc │ 0.8471 │ 0.7706 │ 0.8592 │
├───────────────┼─────────┼──────────────┼────────┤
│ specificity │ 0.9105 │ 0.8393 │ 0.8938 │
├───────────────┼─────────┼──────────────┼────────┤
│ combined_loss │ 0.4619 │ 0.5039 │ 0.4488 │
╘═══════════════╧═════════╧══════════════╧════════╛
"""
# Obtain the combined loss, which is the same across all output features.
combined_loss_for_each_split = [
get_metric_value_or_empty(train_metrics_log[COMBINED], LOSS),
get_metric_value_or_empty(validation_metrics_log[COMBINED], LOSS),
get_metric_value_or_empty(test_metrics_log[COMBINED], LOSS),
]
for output_feature_name in sorted(output_features.keys()):
if output_feature_name == COMBINED:
# Skip the combined output feature. The combined loss will be added to each output feature's table.
continue
print_table_for_single_output_feature(
train_metrics_log[output_feature_name],
validation_metrics_log[output_feature_name],
test_metrics_log[output_feature_name],
combined_loss_for_each_split,
)
================================================
FILE: ludwig/utils/misc_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import copy
import functools
import os
import random
import subprocess
import weakref
from collections import OrderedDict
from collections.abc import Mapping
from typing import Any, TYPE_CHECKING
import numpy
import torch
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import PROC_COLUMN
from ludwig.globals import DESCRIPTION_FILE_NAME, MODEL_FILE_NAME
from ludwig.utils import fs_utils
from ludwig.utils.fs_utils import find_non_existing_dir_by_adding_suffix
if TYPE_CHECKING:
from ludwig.schema.model_types.base import ModelConfig
@DeveloperAPI
def set_random_seed(random_seed):
os.environ["PYTHONHASHSEED"] = str(random_seed)
random.seed(random_seed)
numpy.random.seed(random_seed)
torch.manual_seed(random_seed)
if torch.cuda.is_available() and torch.cuda.device_count() > 0:
torch.cuda.manual_seed(random_seed)
@DeveloperAPI
def merge_dict(dct, merge_dct):
"""Recursive dict merge. Inspired by :meth:``dict.update()``, instead of updating only top-level keys,
dict_merge recurses down into dicts nested to an arbitrary depth, updating keys. The ``merge_dct`` is merged
into ``dct``.
:param dct: dict onto which the merge is executed
:param merge_dct: dct merged into dct
:return: None
"""
dct = copy.deepcopy(dct)
for k, v in merge_dct.items():
if k in dct and isinstance(dct[k], dict) and isinstance(merge_dct[k], Mapping):
dct[k] = merge_dict(dct[k], merge_dct[k])
else:
dct[k] = merge_dct[k]
return dct
@DeveloperAPI
def sum_dicts(dicts, dict_type=dict):
summed_dict = dict_type()
for d in dicts:
for key, value in d.items():
if key in summed_dict:
prev_value = summed_dict[key]
if isinstance(value, (dict, OrderedDict)):
summed_dict[key] = sum_dicts([prev_value, value], dict_type=type(value))
elif isinstance(value, numpy.ndarray):
summed_dict[key] = numpy.concatenate((prev_value, value))
else:
summed_dict[key] = prev_value + value
else:
summed_dict[key] = value
return summed_dict
@DeveloperAPI
def get_from_registry(key, registry):
if hasattr(key, "lower"):
key = key.lower()
if key in registry:
return registry[key]
else:
raise ValueError(f"Key '{key}' not in registry, available options: {registry.keys()}")
@DeveloperAPI
def set_default_value(dictionary, key, value):
if key not in dictionary:
dictionary[key] = value
@DeveloperAPI
def set_default_values(dictionary: dict, default_value_dictionary: dict):
"""This function sets multiple default values recursively for various areas of the config. By using the helper
function set_default_value, It parses input values that contain nested dictionaries, only setting values for
parameters that have not already been defined by the user.
Args:
dictionary (dict): The dictionary to set default values for, generally a section of the config.
default_value_dictionary (dict): The dictionary containing the default values for the config.
"""
for key, value in default_value_dictionary.items():
if key not in dictionary: # Event where the key is not in the dictionary yet
dictionary[key] = value
elif value == {}: # Event where dict is empty
set_default_value(dictionary, key, value)
elif isinstance(value, dict) and value: # Event where dictionary is nested - recursive call
set_default_values(dictionary[key], value)
else:
set_default_value(dictionary, key, value)
@DeveloperAPI
def get_class_attributes(c):
return {i for i in dir(c) if not callable(getattr(c, i)) and not i.startswith("_")}
@DeveloperAPI
def get_output_directory(output_directory, experiment_name, model_name="run"):
base_dir_name = os.path.join(output_directory, experiment_name + ("_" if model_name else "") + (model_name or ""))
return fs_utils.abspath(find_non_existing_dir_by_adding_suffix(base_dir_name))
@DeveloperAPI
def get_file_names(output_directory):
description_fn = os.path.join(output_directory, DESCRIPTION_FILE_NAME)
training_stats_fn = os.path.join(output_directory, "training_statistics.json")
model_dir = os.path.join(output_directory, MODEL_FILE_NAME)
return description_fn, training_stats_fn, model_dir
@DeveloperAPI
def get_combined_features(config):
return config["input_features"] + config["output_features"]
@DeveloperAPI
def get_proc_features(config):
return get_proc_features_from_lists(config["input_features"], config["output_features"])
@DeveloperAPI
def get_proc_features_from_lists(*args):
return {feature[PROC_COLUMN]: feature for features in args for feature in features}
@DeveloperAPI
def set_saved_weights_in_checkpoint_flag(config_obj: "ModelConfig"):
"""Adds a flag to all input feature encoder configs indicating that the weights are saved in the checkpoint.
Next time the model is loaded we will restore pre-trained encoder weights from ludwig model (and not load from cache
or model hub).
"""
for input_feature in config_obj.input_features:
encoder_obj = input_feature.encoder
encoder_obj.saved_weights_in_checkpoint = True
@DeveloperAPI
def remove_empty_lines(str):
return "\n".join([line.rstrip() for line in str.split("\n") if line.rstrip()])
@DeveloperAPI
def memoized_method(*lru_args, **lru_kwargs):
def decorator(func):
@functools.wraps(func)
def wrapped_func(self, *args, **kwargs):
# We're storing the wrapped method inside the instance. If we had
# a strong reference to self the instance would never die.
self_weak = weakref.ref(self)
@functools.wraps(func)
@functools.lru_cache(*lru_args, **lru_kwargs)
def cached_method(*args, **kwargs):
return func(self_weak(), *args, **kwargs)
setattr(self, func.__name__, cached_method)
return cached_method(*args, **kwargs)
return wrapped_func
return decorator
@DeveloperAPI
def get_commit_hash():
"""If Ludwig is run from a git repository, get the commit hash of the current HEAD.
Returns None if git is not executable in the current environment or Ludwig is not run in a git repo.
"""
try:
with open(os.devnull, "w") as devnull:
is_a_git_repo = subprocess.call(["git", "branch"], stderr=subprocess.STDOUT, stdout=devnull) == 0
if is_a_git_repo:
commit_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("utf-8")
return commit_hash
except: # noqa: E722
pass
return None
@DeveloperAPI
def scrub_creds(config_dict: dict[str, Any]) -> dict[str, Any]:
"""Returns a copy of a config dict with all sensitive fields scrubbed."""
if config_dict.get("backend", {}) and "credentials" in config_dict.get("backend", {}):
config_dict["backend"]["credentials"] = {}
return config_dict
================================================
FILE: ludwig/utils/model_utils.py
================================================
import logging
from collections import OrderedDict
import numpy as np
import torch
logger = logging.getLogger(__name__)
NUMPY_TO_TORCH_DTYPE = {
bool: torch.bool,
np.bool_: torch.bool,
np.uint8: torch.uint8,
np.int8: torch.int8,
np.int16: torch.int16,
np.int32: torch.int32,
np.int64: torch.int64,
np.float16: torch.float16,
np.float32: torch.float32,
np.float64: torch.float64,
np.complex64: torch.complex64,
np.complex128: torch.complex128,
}
def extract_tensors(model: torch.nn.Module) -> tuple[torch.nn.Module, list[dict]]:
"""Remove the tensors from a PyTorch model, convert them to NumPy arrays, and return the stripped model and
tensors.
Reference implementation: https://medium.com/ibm-data-ai/how-to-load-pytorch-models-340-times-faster-with-
ray-8be751a6944c # noqa
"""
tensors = []
for _, module in model.named_modules():
# Store the tensors as numpy arrays in Python dictionaries
# Delete the same tensors since we no longer need them and we want to reduce memory pressure.
# This ensures that throughout this process, we keep memory nearly linear w.r.t model parameters.
params = OrderedDict()
buffers = OrderedDict()
for name, param in module.named_parameters(recurse=False):
params[name] = torch.clone(param).detach().numpy()
del param
for name, buf in module.named_buffers(recurse=False):
buffers[name] = torch.clone(buf).detach().numpy()
del buf
tensors.append({"params": params, "buffers": buffers})
# Strip all tensors and buffers out of the original model.
for _, module in model.named_modules():
for name in [name for name, _ in module.named_parameters(recurse=False)] + [
name for name, _ in module.named_buffers(recurse=False)
]:
setattr(module, name, None)
return model, tensors
def replace_tensors(m: torch.nn.Module, tensors: list[dict], device: torch.device):
"""Restore the tensors that extract_tensors() stripped out of a PyTorch model. This operation is performed in
place.
Reference implementation: https://medium.com/ibm-data-ai/how-to-load-pytorch-models-340-times-faster-with-
ray-8be751a6944c # noqa
"""
modules = [module for _, module in m.named_modules()]
for module, tensor_dict in zip(modules, tensors):
# There are separate APIs to set parameters and buffers.
for name, array in tensor_dict["params"].items():
module.register_parameter(
name,
torch.nn.Parameter(torch.as_tensor(array, device=device, dtype=NUMPY_TO_TORCH_DTYPE.get(array.dtype))),
)
for name, array in tensor_dict["buffers"].items():
module.register_buffer(
name,
torch.as_tensor(array, device=device, dtype=NUMPY_TO_TORCH_DTYPE.get(array.dtype)),
)
def find_embedding_layer_with_path(module, module_names=[]):
"""Recursively search through a module to find an embedding layer and its module path.
Returns a tuple containing the embedding layer and its module path.
"""
for name, child_module in module.named_children():
if isinstance(child_module, torch.nn.Embedding):
# If an embedding layer is found, return it along with the module path
return child_module, ".".join(module_names + [name])
else:
# Recursively search in the child module and update the module_names list
found, path = find_embedding_layer_with_path(child_module, module_names + [name])
if found is not None:
return found, path
return None, None
def contains_nan_or_inf_tensors(module: torch.nn.Module) -> bool:
"""Check for NaN or infinity (inf) values in the tensors (parameters and buffers) of a PyTorch module. This
function recursively inspects the module's parameters and buffers to identify NaN or inf values. It is designed
to ensure the numerical stability of the model by detecting any irregularities in the tensor values.
Parameters:
module (torch.nn.Module): The PyTorch module to check for NaN or inf values.
Returns:
bool: Returns True if any NaN or inf values are found in the module's tensors. Otherwise, returns False.
"""
for name, param in module.named_parameters():
if param.requires_grad and (torch.isnan(param).any() or torch.isinf(param).any()):
logger.info(f"Found NaN or inf values in parameter '{name}' of module '{module.__class__.__name__}'")
return True
for name, buffer in module.named_buffers():
if torch.isnan(buffer).any() or torch.isinf(buffer).any():
logger.info(f"Found NaN or inf values in buffer '{name}' of module '{module.__class__.__name__}'")
return True
for name, submodule in module.named_children():
if contains_nan_or_inf_tensors(submodule):
logger.info(f"Found NaN or inf values in submodule '{name}' of module '{module.__class__.__name__}'")
return True
return False
================================================
FILE: ludwig/utils/nlp_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import sys
logger = logging.getLogger(__name__)
nlp_pipelines = {
"en": None,
"it": None,
"es": None,
"de": None,
"fr": None,
"pt": None,
"nl": None,
"el": None,
"nb": None,
"lt": None,
"da": None,
"pl": None,
"ro": None,
"ja": None,
"zh": None,
"xx": None,
}
language_module_registry = {
"en": "en_core_web_sm",
"it": "it_core_news_sm",
"es": "es_core_news_sm",
"de": "de_core_news_sm",
"fr": "fr_core_news_sm",
"pt": "pt_core_news_sm",
"nl": "nl_core_news_sm",
"el": "el_core_news_sm",
"nb": "nb_core_news_sm",
"lt": "lt_core_news_sm",
"da": "da_core_news_sm",
"pl": "pl_core_news_sm",
"ro": "ro_core_news_sm",
"ja": "ja_core_news_sm",
"zh": "zh_core_web_sm",
"xx": "xx_ent_wiki_sm",
}
default_characters = [
" ",
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j",
"k",
"l",
"m",
"n",
"o",
"p",
"q",
"r",
"s",
"t",
"u",
"v",
"w",
"x",
"y",
"z",
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"8",
"9",
"-",
",",
";",
".",
"!",
"?",
":",
"'",
"'",
"/",
"\\",
"|",
"_",
"@",
"#",
"$",
"%",
"^",
"&",
"*",
"~",
"`",
"+",
"-",
"=",
"<",
">",
"(",
")",
"[",
"]",
"{",
"}",
]
punctuation = {".", ",", "@", "$", "%", "/", ":", ";", "+", "="}
def load_nlp_pipeline(language="xx"):
if language not in language_module_registry:
logger.error(
"Language {} is not supported."
"Suported languages are: {}".format(language, language_module_registry.keys())
)
raise ValueError
else:
spacy_module_name = language_module_registry[language]
if nlp_pipelines[language] is None:
logger.info("Loading NLP pipeline")
try:
import spacy
except ImportError:
logger.error(
" spacy is not installed. "
"In order to install all text feature dependencies run "
"pip install ludwig[text]"
)
sys.exit(-1)
try:
nlp_pipelines[language] = spacy.load(spacy_module_name, disable=["parser", "tagger", "ner"])
except OSError:
logger.info(" spaCy {} model is missing, downloading it " "(this will only happen once)")
from spacy.cli import download
download(spacy_module_name)
nlp_pipelines[language] = spacy.load(spacy_module_name, disable=["parser", "tagger", "ner"])
return nlp_pipelines[language]
def pass_filters(
token, filter_numbers=False, filter_punctuation=False, filter_short_tokens=False, filter_stopwords=False
):
passes_filters = True
if filter_numbers:
passes_filters = not token.like_num
if passes_filters and filter_punctuation:
passes_filters = not bool(set(token.orth_) & punctuation)
if passes_filters and filter_short_tokens:
passes_filters = len(token) > 2
if passes_filters and filter_stopwords:
passes_filters = not token.is_stop
return passes_filters
def process_text(
text,
nlp_pipeline,
return_lemma=False,
filter_numbers=False,
filter_punctuation=False,
filter_short_tokens=False,
filter_stopwords=False,
):
doc = nlp_pipeline(text)
return [
token.lemma_ if return_lemma else token.text
for token in doc
if pass_filters(token, filter_numbers, filter_punctuation, filter_short_tokens, filter_stopwords)
]
if __name__ == "__main__":
text = (
"Hello John, how are you doing my good old friend? Are you still number 732 in the list? Did you pay $32.43 or "
"54.21 for the book?"
)
print(process_text(text, load_nlp_pipeline()))
print(
process_text(text, load_nlp_pipeline(), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True)
)
print(process_text(text, load_nlp_pipeline(), filter_stopwords=True))
print(process_text(text, load_nlp_pipeline(), return_lemma=True))
print(
process_text(
text,
load_nlp_pipeline(),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
)
print(process_text(text, load_nlp_pipeline(), return_lemma=True, filter_stopwords=True))
================================================
FILE: ludwig/utils/numerical_test_utils.py
================================================
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from typing import Any
import numpy as np
def _dict_like(x):
"""Returns true if an object is a dict or convertible to one, false if not."""
try:
_ = dict(x)
except (TypeError, ValueError):
return False
return True
def _enumerable(x):
"""Returns true if an object is enumerable, false if not."""
try:
_ = enumerate(x)
except (TypeError, ValueError):
return False
return True
def assert_all_finite(x: Any, keypath=""):
"""Ensures that all scalars at all levels of the dictionary, list, array, or scalar are finite.
keypath is only used for logging error messages, to indicate where the non-finite value was detected.
"""
path_description = f" at {keypath} " if keypath else " "
if np.isscalar(x):
assert np.isfinite(x), f"Value{path_description}should be finite, but is {str(x)}."
elif isinstance(x, np.ndarray):
non_finite_indices = np.nonzero(~np.isfinite(x))
non_finite_values = x[non_finite_indices]
assert np.all(np.isfinite(x)), (
f"All values{path_description}should be finite, but found {str(non_finite_values)} "
"at positions {str(np.array(non_finite_indices).flatten())}."
)
elif _dict_like(x):
# x is either a dict or convertible to one
for k, v in dict(x).items():
assert_all_finite(v, keypath=keypath + "." + str(k) if keypath else str(k))
elif _enumerable(x):
# x is a list, set or other enumerable type, but not a string, dict, or numpy array.
for i, v in enumerate(x):
assert_all_finite(v, keypath=keypath + f"[{i}]")
else:
assert False, f"Unhandled type {str(type(x))} for value{path_description}"
================================================
FILE: ludwig/utils/output_feature_utils.py
================================================
"""Utilities used for managing output feature dicts."""
import numpy as np
import torch
from ludwig.utils.torch_utils import sequence_length_3D, sequence_mask
def get_feature_concat_name(feature_name: str, tensor_name: str) -> str:
return feature_name + "::" + tensor_name
def get_tensor_name_from_concat_name(concat_name: str) -> str:
return concat_name.split("::")[-1]
def get_feature_name_from_concat_name(concat_name: str) -> str:
return "::".join(concat_name.split("::")[:-1])
def get_single_output_feature_tensors(
output_feature_dict: dict[str, torch.Tensor], feature_name: str
) -> dict[str, torch.Tensor]:
"""Returns a map of tensors related to the given feature_name."""
single_output_feature_tensors = {}
for concat_name, tensor in output_feature_dict.items():
if get_feature_name_from_concat_name(concat_name) == feature_name:
single_output_feature_tensors[get_tensor_name_from_concat_name(concat_name)] = tensor
return single_output_feature_tensors
def get_output_feature_tensor(
output_dict: dict[str, torch.Tensor], feature_name: str, tensor_name: str
) -> torch.Tensor:
"""Returns a tensor related for the given feature_name and tensor_name."""
concat_name = get_feature_concat_name(feature_name, tensor_name)
if concat_name not in output_dict:
raise ValueError(
f"Could not find {tensor_name} for {feature_name} in the output_dict with keys: {output_dict.keys()}"
)
return output_dict[get_feature_concat_name(feature_name, tensor_name)]
def set_output_feature_tensor(
output_dict: dict[str, torch.Tensor], feature_name: str, tensor_name: str, tensor: torch.Tensor
):
"""Adds tensor for the given feature_name and tensor_name to the tensor dict."""
output_dict[get_feature_concat_name(feature_name, tensor_name)] = tensor
def concat_dependencies(
feature_name: str,
dependencies: list[str],
dependency_reducers: torch.ModuleDict,
combiner_hidden_state: torch.Tensor,
other_output_feature_states: dict[str, torch.Tensor],
) -> torch.Tensor:
"""Concatenates combiner_hidden_state with other output feature hidden states based on listed dependencies."""
# No dependencies.
if not dependencies:
return combiner_hidden_state
dependency_hidden_states = []
for feature_name in dependencies:
# The dependent feature should be present since ECD does a topological sort over output features.
feature_hidden_state = other_output_feature_states[feature_name]
# This feature is sequential.
if len(combiner_hidden_state.shape) > 2:
if len(feature_hidden_state.shape) > 2:
# The dependent feature is also sequential.
# matrix matrix -> concat
assert combiner_hidden_state.shape[1] == feature_hidden_state.shape[1]
dependency_hidden_states.append(feature_hidden_state)
else:
# The dependent feature is not sequential.
# matrix vector -> tile concat
sequence_max_length = combiner_hidden_state.shape[1]
multipliers = (1, sequence_max_length, 1)
tiled_representation = torch.tile(torch.unsqueeze(feature_hidden_state, 1), multipliers)
sequence_length = sequence_length_3D(combiner_hidden_state)
mask = sequence_mask(sequence_length, sequence_max_length)
tiled_representation = torch.mul(
tiled_representation,
mask[:, :, np.newaxis].type(torch.float32),
)
dependency_hidden_states.append(tiled_representation)
else:
# This feature is not sequential.
if len(feature_hidden_state.shape) > 2:
# The dependent feature is sequential.
# vector matrix -> reduce concat
reducer = dependency_reducers[feature_name]
dependency_hidden_states.append(reducer(feature_hidden_state))
else:
# The dependent feature is not sequential.
# vector vector -> concat
dependency_hidden_states.append(feature_hidden_state)
try:
hidden = torch.cat([combiner_hidden_state] + dependency_hidden_states, dim=-1)
except Exception as e:
raise ValueError(
f"Shape mismatch {e} while concatenating dependent features of {feature_name}: "
f"{dependencies}. Concatenating the feature activations tensor {combiner_hidden_state} "
f"with activation tensors of dependencies: {dependency_hidden_states}. The error is "
"likely due to a mismatch of the second dimension (sequence length) or a "
"difference in ranks. Likely solutions are setting the maximum_sequence_length "
"of all sequential features to be the same, or reduce the output of some "
"features, or disabling the bucketing setting bucketing_field to None / null, "
"as activating it will reduce the length of the field the bucketing is "
"performed on."
)
return hidden
================================================
FILE: ludwig/utils/package_utils.py
================================================
import importlib
import types
class LazyLoader(types.ModuleType):
"""Lazily import a module, mainly to avoid pulling in large dependencies.
`contrib`, and `ffmpeg` are examples of modules that are large and not always
needed, and this allows them to only be loaded when they are used.
Copied from: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/util/lazy_loader.py
"""
# The lint error here is incorrect.
def __init__(self, local_name, parent_module_globals, name): # pylint: disable=super-on-old-class
self._local_name = local_name
self._parent_module_globals = parent_module_globals
super().__init__(name)
def _load(self):
# Import the target module and insert it into the parent's namespace
module = importlib.import_module(self.__name__)
self._parent_module_globals[self._local_name] = module
# Update this object's dict so that if someone keeps a reference to the
# LazyLoader, lookups are efficient (__getattr__ is only called on lookups
# that fail).
self.__dict__.update(module.__dict__)
return module
def __getattr__(self, item):
module = self._load()
return getattr(module, item)
def __dir__(self):
module = self._load()
return dir(module)
================================================
FILE: ludwig/utils/print_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
from collections import OrderedDict
from pprint import pformat
from ludwig.api_annotations import DeveloperAPI
logger = logging.getLogger(__name__)
@DeveloperAPI
def get_logging_level_registry() -> dict[str, int]:
return {
"critical": logging.CRITICAL,
"error": logging.ERROR,
"warning": logging.WARNING,
"info": logging.INFO,
"debug": logging.DEBUG,
"notset": logging.NOTSET,
}
@DeveloperAPI
def get_logo(message, ludwig_version):
return "\n".join(
[
"███████████████████████",
"█ █ █ █ ▜█ █ █ █ █ █",
"█ █ █ █ █ █ █ █ █ █ ███",
"█ █ █ █ █ █ █ █ █ ▌ █",
"█ █████ █ █ █ █ █ █ █ █",
"█ █ ▟█ █ █ █",
"███████████████████████",
f"ludwig v{ludwig_version} - {message}",
"",
]
)
@DeveloperAPI
def print_ludwig(message, ludwig_version):
logger.info(get_logo(message, ludwig_version))
@DeveloperAPI
def print_boxed(text, print_fun=logger.info):
box_width = len(text) + 2
print_fun("")
print_fun("╒{}╕".format("═" * box_width))
print_fun(f"│ {text.upper()} │")
print_fun("╘{}╛".format("═" * box_width))
print_fun("")
@DeveloperAPI
def repr_ordered_dict(d: OrderedDict):
return "{" + ",\n ".join(f"{x}: {pformat(y, indent=4)}" for x, y in d.items()) + "}"
@DeveloperAPI
def query_yes_no(question: str, default: str | None = "yes"):
"""Ask a yes/no question via raw_input() and return their answer.
Args:
question: String presented to the user
default: The presumed answer from the user. Must be "yes", "no", or None (Answer is required)
Returns: Boolean based on prompt response
"""
valid = {"yes": True, "y": True, "ye": True, "no": False, "n": False}
if default is None:
prompt = " [y/n] "
elif default == "yes":
prompt = " [Y/n] "
elif default == "no":
prompt = " [y/N] "
else:
raise ValueError("invalid default answer: '%s'" % default)
while True:
logger.info(question + prompt)
choice = input().lower()
if default is not None and choice == "":
return valid[default]
elif choice in valid:
return valid[choice]
else:
logger.info("Please respond with 'yes' or 'no' " "(or 'y' or 'n').\n")
================================================
FILE: ludwig/utils/registry.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from collections import UserDict
from typing import Generic, TypeVar
DEFAULT_KEYS = ["None", "none", "null", None]
T = TypeVar("T")
class Registry(UserDict, Generic[T]):
"""Registry is like a normal dict, but with an optional parent dict.
Items are considered to exist in the registry if they are added to either the registry itself, or its parent.
"""
def __init__(self, source=None):
init_data = None
parent = {}
if isinstance(source, Registry):
parent = source
else:
init_data = source
self.parent = parent
super().__init__(init_data)
def __getitem__(self, key: str) -> T:
if self.parent and key not in self.data:
return self.parent.__getitem__(key)
return self.data.__getitem__(key)
def __contains__(self, key: str):
return key in self.data or key in self.parent
def __len__(self) -> int:
return len(self.data) + len(self.parent)
def __iter__(self):
return self._merged().__iter__()
def keys(self):
return self._merged().keys()
def values(self):
return self._merged().values()
def items(self):
return self._merged().items()
def _merged(self):
return {**self.parent, **self.data}
def register(self, name: str, default: bool = False):
def wrap(cls):
self[name] = cls
if default:
for key in DEFAULT_KEYS:
self[key] = cls
return cls
return wrap
================================================
FILE: ludwig/utils/server_utils.py
================================================
import json
import os
import tempfile
from typing import Any
import numpy as np
import pandas as pd
from starlette.datastructures import UploadFile
from starlette.responses import JSONResponse
from ludwig.utils.data_utils import NumpyEncoder
def serialize_payload(data_source: pd.DataFrame | pd.Series) -> tuple:
"""
Generates two dictionaries to be sent via REST API for Ludwig prediction
service.
First dictionary created is payload_dict. Keys found in payload_dict:
raw_data: this is json string created by pandas to_json() method
source_type: indicates if the data_source is either a pandas dataframe or
pandas series. This is needed to know how to rebuild the structure.
ndarray_dtype: this is a dictionary where each entry is for any ndarray
data found in the data_source. This could be an empty dictioinary if no
ndarray objects are present in data_source. Key for this dictionary is
column name if data_source is dataframe or index name if data_source is
series. The value portion of the dictionary is the dtype of the
ndarray. This value is used to set the correct dtype when rebuilding
the entry.
Second dictionary created is called payload_files, this contains information
and content for files to be sent to the server. NOTE: if no files are to be
sent, this will be an empty dictionary.
Entries in this dictionary:
Key: file path string for file to be sent to server
Value: tuple(file path string, byte encoded file content,
'application/octet-stream')
Args:
data_source: input features to be sent to Ludwig server
Returns: tuple(payload_dict, payload_files)
"""
payload_dict = {}
payload_dict["ndarray_dtype"] = {}
payload_files = {}
if isinstance(data_source, pd.DataFrame):
payload_dict["raw_data"] = data_source.to_json(orient="columns")
payload_dict["source_type"] = "dataframe"
for col in data_source.columns:
if isinstance(data_source[col].iloc[0], np.ndarray):
# if we have any ndarray columns, record dtype
payload_dict["ndarray_dtype"][col] = str(data_source[col].iloc[0].dtype)
elif isinstance(data_source[col].iloc[0], str) and os.path.exists(data_source[col].iloc[0]):
# if we have file path feature, prepare file for transport
for v in data_source[col]:
payload_files[v] = (v, open(v, "rb"), "application/octet-stream")
elif isinstance(data_source, pd.Series):
payload_dict["raw_data"] = data_source.to_json(orient="index")
payload_dict["source_type"] = "series"
for col in data_source.index:
if isinstance(data_source[col], np.ndarray):
# for ndarrays record dtype for reconstruction
payload_dict["ndarray_dtype"][col] = str(data_source[col].dtype)
elif isinstance(data_source[col], str) and os.path.exists(data_source[col]):
# if we have file path feature, prepare file for transport
v = data_source[col]
payload_files[v] = (v, open(v, "rb"), "application/octet-stream")
else:
ValueError(
'"data_source" must be either a pandas DataFrame or Series, '
"format found to be {}".format(type(data_source))
)
return payload_dict, payload_files
def _write_file(v, files):
# Convert UploadFile to a NamedTemporaryFile to ensure it's on the disk
suffix = os.path.splitext(v.filename)[1]
named_file = tempfile.NamedTemporaryFile(delete=False, suffix=suffix)
files.append(named_file)
named_file.write(v.file.read())
named_file.close()
return named_file.name
def deserialize_payload(json_string: str) -> pd.DataFrame:
"""This function performs the inverse of the serialize_payload function and rebuilds the object represented in
json_string to a pandas DataFrame.
Args:
json_string: representing object to be rebuilt.
Returns: pandas.DataFrame
"""
payload_dict = json.loads(json_string)
# extract raw data from json string
raw_data_dict = json.loads(payload_dict["raw_data"])
# rebuild based on original data source
if payload_dict["source_type"] == "dataframe":
# reconstitute the pandas dataframe
df = pd.DataFrame.from_dict(raw_data_dict, orient="columns")
elif payload_dict["source_type"] == "series":
# reconstitute series into single row dataframe
df = pd.DataFrame(pd.Series(raw_data_dict)).T
else:
ValueError(
'Unknown "source_type" found. Valid values are "dataframe" or '
'"series". Instead found {}'.format(payload_dict["source_type"])
)
# if source has ndarrays, rebuild those from list and set
# original dtype.
if payload_dict["ndarray_dtype"]:
# yes, now covert list representation to ndarray representation
for col in payload_dict["ndarray_dtype"]:
dtype = payload_dict["ndarray_dtype"][col]
df[col] = df[col].apply(lambda x: np.array(x).astype(dtype))
return df
def deserialize_request(form) -> tuple:
"""This function will deserialize the REST API request packet to create a pandas dataframe that is input to the
Ludwig predict method and a list of files that will be cleaned up at the end of processing.
Args:
form: REST API provide form data
Returns: tuple(pandas.DataFrame, list of temporary files to clean up)
"""
files = []
file_index = {}
for k, v in form.multi_items():
if type(v) is UploadFile:
file_index[v.filename] = _write_file(v, files)
# reconstruct the dataframe
df = deserialize_payload(form["payload"])
# insert files paths of the temporary files in place of the original
# file paths specified by the user.
# pd.DataFrame.replace() method is used to replace file path string
# specified by the user context with the file path string where a
# temporary file containing the same content.
# parameters for replace() method:
# to_replace: list of file path strings that the user provided
# value: list of temporary files created for each input file
#
# IMPORTANT: There is a one-to-one correspondence of the to_replace list
# and the value list. Each list must be the same size.
df.replace(to_replace=list(file_index.keys()), value=list(file_index.values()), inplace=True)
return df, files
class NumpyJSONResponse(JSONResponse):
def render(self, content: dict[str, Any]) -> str:
"""Override the default JSONResponse behavior to encode numpy arrays.
Args:
content: JSON object to be serialized.
Returns: str
"""
return json.dumps(
content, ensure_ascii=False, allow_nan=False, indent=None, separators=(",", ":"), cls=NumpyEncoder
).encode("utf-8")
================================================
FILE: ludwig/utils/state_dict_backward_compatibility.py
================================================
# Copyright (c) 2023 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
def _update_transformers_to_freeze_module(state_dict):
"""Updates pre-trained encoders which were saved prior to the addition of FreezeModule."""
return {
(
k.replace("encoder_obj.transformer.", "encoder_obj.transformer.module.")
if "encoder_obj.transformer.module" not in k
else k
): v
for k, v in state_dict.items()
}
def _update_combiner_no_input_features(state_dict):
"""Removed combiner.input_features from state_dict following DeepSpeed integration."""
return {k: v for k, v in state_dict.items() if not k.startswith("combiner.input_features.")}
def _update_combiner_no_device_tensor(state_dict):
"""Removed device_tensor from state_dict following DeepSpeed integration."""
return {k: v for k, v in state_dict.items() if not k.endswith("device_tensor")}
def update_state_dict(state_dict):
"""Checks state_dict on load, updates state dict if needed."""
state_dict = _update_transformers_to_freeze_module(state_dict)
state_dict = _update_combiner_no_input_features(state_dict)
state_dict = _update_combiner_no_device_tensor(state_dict)
return state_dict
================================================
FILE: ludwig/utils/strings_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import re
import unicodedata
from collections import Counter
from dataclasses import dataclass
from enum import Enum
import numpy as np
from dateutil.parser import parse as parse_datetime
from ludwig.constants import PADDING_SYMBOL, START_SYMBOL, STOP_SYMBOL, UNKNOWN_SYMBOL
from ludwig.data.dataframe.base import DataFrameEngine
from ludwig.data.dataframe.pandas import PANDAS
from ludwig.utils.fs_utils import open_file
from ludwig.utils.math_utils import int_type
from ludwig.utils.tokenizers import get_tokenizer_from_registry
from ludwig.utils.types import Series
PANDAS_TRUE_STRS = {"true"}
PANDAS_FALSE_STRS = {"false"}
BOOL_TRUE_STRS = {"yes", "y", "true", "t", "1", "1.0"}
BOOL_FALSE_STRS = {"no", "n", "false", "f", "0", "0.0", "-1", "-1.0"}
logger = logging.getLogger(__name__)
class SpecialSymbol(Enum):
"""Special symbols used for text features."""
STOP = 0
START = 1
PADDING = 2
UNKNOWN = 3
def all_bool_strs():
"""Returns all valid boolean strings, with varied capitalization."""
fns = [lambda x: x, lambda x: x.upper(), lambda x: x.capitalize()]
return sorted({fn(x) for fn in fns for x in BOOL_TRUE_STRS | BOOL_FALSE_STRS})
def make_safe_filename(s):
def safe_char(c):
if c.isalnum():
return c
else:
return "_"
return "".join(safe_char(c) for c in s).rstrip("_")
def strip_accents(s):
return "".join(c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn")
def str2bool(v: str, fallback_true_label=None) -> bool:
"""Returns bool representation of the given value v.
Check the value against global bool string lists.
Fallback to using fallback_true_label as True if the value isn't in the global bool lists.
args:
v: Value to get the bool representation for.
fallback_true_label: (str) label to use as 'True'.
"""
v_str = str(v).lower()
if v_str in BOOL_TRUE_STRS:
return True
if v_str in BOOL_FALSE_STRS:
return False
if fallback_true_label is None:
raise ValueError(
f"Cannot automatically map value '{v}' to a boolean and no `preprocessing.fallback_true_label` specified"
)
return v == fallback_true_label
def values_are_pandas_numbers(values: list[str]):
"""Returns True if values would be read by pandas as dtype float or int."""
for v in values:
try:
float(v)
except ValueError:
return False
return True
def values_are_pandas_bools(values: list[str]):
"""Returns True if values would be read by pandas as dtype bool."""
lowercase_values_set = {str(v).lower() for v in values}
return lowercase_values_set.issubset(PANDAS_FALSE_STRS | PANDAS_TRUE_STRS)
def are_conventional_bools(values: list[str | bool]) -> bool:
"""Returns whether all values are conventional booleans."""
for value in values:
lower_value = str(value).lower()
if lower_value not in BOOL_TRUE_STRS and lower_value not in BOOL_FALSE_STRS:
return False
return True
def is_number(s: str | int | float):
"""Returns whether specified value is number."""
if isinstance(s, str) and s.lower() == "nan":
return True
try:
float(s)
return True
except ValueError:
return False
def is_datetime(s: str | int | float):
"""Returns whether specified value is datetime."""
if is_number(s):
return False
try:
parse_datetime(s)
return True
except Exception:
return False
def are_all_datetimes(values: list[str | int | float]):
"""Returns whether all values are datetimes."""
for value in values:
if not is_datetime(value):
return False
return True
def are_all_numbers(values: list[str | int | float]):
"""Returns whether all values are numbers."""
for value in values:
if not is_number(value):
return False
return True
def is_integer(s: str | int | float):
"""Returns whether specified value is an integer."""
try:
float(s)
except ValueError:
return False
else:
return float(s).is_integer() and not np.isnan(float(s))
def are_sequential_integers(values: list[str | int | float]):
"""Returns whether distinct values form sequential integer list."""
int_list = []
for value in values:
if not is_integer(value):
return False
int_list.append(int(float(value)))
return (max(int_list) - min(int_list) + 1) == len(int_list)
def match_replace(string_to_match, list_regex):
"""Matches strings against regular expressions.
arguments:
string_to_match -- the string to match
returns:
string_to_match -- the cleaned string
matched -- the list of regular expressions that matched
"""
matched = []
for regex in list_regex:
match = re.search(regex[0], string_to_match)
if match:
string_to_match = re.sub(regex[0], regex[1], string_to_match)
matched.append(regex[0].pattern)
return string_to_match, matched
def load_vocabulary(vocab_file):
with open_file(vocab_file, "r", encoding="utf-8") as f:
vocabulary = []
for line in f:
line = line.strip()
if " " in line:
line = line.split(" ")[0]
vocabulary.append(line)
return vocabulary
def add_or_move_symbol(vocab_list: list[str], vocab_set: set[str], symbol: str, index: int):
"""Inserts or moves the symbol to the specified index."""
if symbol in vocab_set:
vocab_list.remove(symbol)
vocab_list.insert(index, symbol)
@dataclass
class Vocabulary:
vocab: list[str]
"""List of strings representing the computed vocabulary."""
str2idx: dict[str, int]
"""Map of symbol to index."""
str2freq: dict[str, int]
"""Map of symbol to frequency."""
str2idf: dict[str, int] | None
"""Map of symbol to inverse document frequency."""
max_sequence_length: int
"""Maximum sequence length."""
sequence_length_99ptile: int
"""99th percentile of maximum sequence length."""
pad_idx: int
"""Index to padding symbol."""
padding_symbol: str
"""Actual padding symbol."""
unknown_symbol: str
"""Actual unknown symbol."""
prompt_template_num_tokens: int = 0
"""The number of tokens in the prompt template.
If -1, then there is no prompt template.
"""
def _get_vocab_from_dict(vocab: dict[str, int]) -> list[str]:
"""Returns a vocab in list format from a vocab token=>idx dictionary."""
vocab_values = list(vocab.values())
if len(set(vocab_values)) != len(vocab_values):
raise ValueError("Vocabulary has duplicate mappings in its vocabulary. This should never happen.")
# construct a vocab that is a list that reflects the token=>index mapping in HF's vocab
# pre-allocate a list to make sure each index is inited to prevent OBO errors caused by missing indices
max_idx = max(vocab_values)
vocab_list = [None for _ in range(max_idx + 1)]
for token, idx in vocab.items():
vocab_list[idx] = token
return vocab_list
def _get_vocabulary(
tokenizer_type: str,
tokenizer,
vocab_file: str,
unknown_symbol: str,
add_special_symbols: bool,
padding_symbol: str,
unit_counts: Counter,
num_most_frequent: int,
) -> list[str] | None:
"""Returns the vocabulary from the tokenizer_type, tokenizer, or vocab_file.
If the `tokenizer_type` is 'hf_tokenizer', then the set vocabulary from the tokenizer is used.
If there's no vocab_file or if the tokenizer has no set vocabulary (e.g. space_punct), then the vocabulary is
determined from the tokenized data (unit_counts).
The UNKNOWN special symbol is always included in the final vocabulary. Additional special symbols (PADDING, START,
STOP) are added if add_special_symbols=True. If the tokenizer is a pre-trained huggingface tokenizer, then the
special symbols are taken from the tokenizer's vocabulary.
"""
# Pre-trained huggingface tokenizer. Use the pre-existing vocabulary and special symbols.
if tokenizer_type == "hf_tokenizer":
try:
return _get_vocab_from_dict(tokenizer.get_vocab())
except NotImplementedError:
logger.warning(
"HuggingFace tokenizer does not have a get_vocab() method. "
+ "Using tokenizer.tokenizer.vocab_size and tokenizer.tokenizer._convert_id_to_token "
+ "to build the vocabulary."
)
vocab = []
for idx in range(tokenizer.tokenizer.vocab_size):
vocab.append(tokenizer.tokenizer._convert_id_to_token(idx))
vocab += tokenizer.tokenizer.added_tokens_encoder.keys()
return vocab
# The tokenizer has a preset vocabulary.
if hasattr(tokenizer, "get_vocab"):
return _get_vocab_from_dict(tokenizer.get_vocab())
# Load the vocabulary from the vocab file.
if vocab_file is not None:
return load_vocabulary(vocab_file)
# The tokenizer had no preset vocabulary, for example space_punct.
# Compute the vocabulary from tokenized data.
return [unit for unit, _ in unit_counts.most_common(num_most_frequent)]
def remove_bracketed_elements(prompt_template: str) -> str:
"""Example: -> ."""
pattern = r"\{.*?\}"
return re.sub(pattern, "", prompt_template)
def create_vocabulary(
data: Series,
tokenizer_type: str = "space",
lowercase: bool = True,
num_most_frequent: int = None,
vocab_file: str = None,
add_special_symbols: bool = True,
unknown_symbol: str = UNKNOWN_SYMBOL,
padding_symbol: str = PADDING_SYMBOL,
start_symbol: str = START_SYMBOL,
stop_symbol: str = STOP_SYMBOL,
pretrained_model_name_or_path: str = None,
ngram_size: int | None = None,
compute_idf: bool = False,
processor: DataFrameEngine = PANDAS,
prompt_template: str = "",
) -> Vocabulary:
"""Computes a vocabulary over the provided data frame.
This function is used when the data consists of multiple tokens within one example. E.g., words in a text feature,
items in a set feature, etc. If the feature only contains a single token like for category features,
`create_vocabulary_single_token` should be used instead, as it is more efficient.
A tokenizer is specified using the `tokenizer_type`. The tokenizer will be used to process all of the data
provided, producing an indexed vocabulary with frequency counts. If the `tokenizer_type` is 'hf_tokenizer',
then a pre-trained huggingface tokenizer is loaded from `pretrained_model_name_or_path` and that vocabulary is
used directly.
The UNKNOWN special symbol is always included in the final vocabulary. Additional special symbols (PADDING, START,
STOP) are added if add_special_symbols=True. If the tokenizer is a pre-trained huggingface tokenizer, then the
special symbols are taken from the tokenizer's vocabulary.
Args:
prompt_template: The prompt template for the model. Applicable only to LLMs.
data: Series of string data.
tokenizer_type: Tokenizer type. Can be a tokenizer registry value or 'hf_tokenizer' for huggingface.
lowercase: Whether to lowercase all strings.
num_most_frequent: Upper limit on vocabulary size.,
add_special_symbols: If True, START, STOP, PADDING special symbols are added to the vocabulary. UNKNOWN is
always added.
unknown_symbol: String representation for the UNKNOWN symbol.
padding_symbol: String representation for the PADDING symbol.
start_symbol: String representation for the START symbol.
stop_symbol: String representation for the STOP symbol.
pretrained_model_name_or_path: Name/path to huggingface model.
ngram_size: Size of the n-gram when using `ngram` tokenizer.
compute_idf: If True, computes the inverse document frequency for each token.
processor: Which processor to use to process data.
Returns:
Vocabulary object containing metadata about the vocab.
TODO(Justin): Clean up pad_idx, padding_symbol, unknown_symbol return, as no one seems to be using it.
"""
tokenizer = get_tokenizer_from_registry(tokenizer_type)(
vocab_file=vocab_file,
pretrained_model_name_or_path=pretrained_model_name_or_path,
ngram_size=ngram_size,
)
# Number of tokens in template.
prompt_template_num_tokens = -1
if prompt_template:
prompt_without_bracketed_elements = remove_bracketed_elements(prompt_template)
prompt_template_num_tokens = len(tokenizer(prompt_without_bracketed_elements))
# Tokenize the data.
def process_line(line):
return tokenizer(line.lower() if lowercase else line)
processed_lines = processor.map_objects(data, process_line)
processed_counts = processed_lines.explode().value_counts(sort=False)
processed_counts = processor.compute(processed_counts)
unit_counts = Counter(dict(processed_counts))
max_sequence_length = processor.compute(processed_lines.map(len).max())
sequence_length_99ptile = processor.compute(processed_lines.map(len).quantile(0.99))
if tokenizer_type != "hf_tokenizer":
# For non-HF tokenizers, add 2 for start and stop symbols.
max_sequence_length += 2
sequence_length_99ptile += 2
pad_idx = None
if tokenizer_type == "hf_tokenizer":
# Replace the special symbols with the ones from the tokenizer.
unknown_symbol = tokenizer.get_unk_token()
padding_symbol = tokenizer.get_pad_token()
pad_idx = tokenizer.convert_token_to_id(padding_symbol)
vocab: list[str] = _get_vocabulary(
tokenizer_type,
tokenizer,
vocab_file,
unknown_symbol,
add_special_symbols,
padding_symbol,
unit_counts,
num_most_frequent,
)
vocab_set = set(vocab)
doc_unit_counts = None
if compute_idf:
# The document frequency used for TF-IDF. Similar to unit_counts, but de-duped by document.
document_counts = processed_lines.map(lambda x: set(x)).explode().value_counts(sort=False)
document_counts = processor.compute(document_counts)
doc_unit_counts = Counter(dict(document_counts))
if tokenizer_type != "hf_tokenizer":
if add_special_symbols:
add_or_move_symbol(vocab, vocab_set, stop_symbol, SpecialSymbol.STOP.value)
add_or_move_symbol(vocab, vocab_set, start_symbol, SpecialSymbol.START.value)
add_or_move_symbol(vocab, vocab_set, padding_symbol, SpecialSymbol.PADDING.value)
# Always add the UNKNOWN symbol if we're using our own tokenizer.
add_or_move_symbol(vocab, vocab_set, unknown_symbol, SpecialSymbol.UNKNOWN.value)
str2idx = {unit: i for i, unit in enumerate(vocab)}
str2freq = {unit: unit_counts.get(unit) if unit in unit_counts else 0 for unit in vocab}
str2idf = (
{unit: np.log(len(vocab) / (1 + doc_unit_counts.get(unit))) if unit in doc_unit_counts else 0 for unit in vocab}
if compute_idf
else None
)
if pad_idx is None and padding_symbol in str2idx.keys():
pad_idx = str2idx[padding_symbol]
return Vocabulary(
vocab=vocab,
str2idx=str2idx,
str2freq=str2freq,
str2idf=str2idf,
max_sequence_length=max_sequence_length,
sequence_length_99ptile=sequence_length_99ptile,
pad_idx=pad_idx,
padding_symbol=padding_symbol,
unknown_symbol=unknown_symbol,
prompt_template_num_tokens=prompt_template_num_tokens,
)
def create_vocabulary_single_token(
data: Series,
num_most_frequent: int | None = None,
processor: DataFrameEngine = PANDAS,
unknown_symbol: str = UNKNOWN_SYMBOL,
):
"""Computes a vocabulary over the provided data frame.
This function should be used iff the values in each row of data should be considered as a single token, e.g.,
category features ("interested", "not interested", "somewhat interested").
This assumption allows us to be more efficient than `create_vocabulary()` as we can skip tokenization and
computing the maximum sequence length, which are unnecessary for category features.
Args:
data: Series of string data.
num_most_frequent: Upper limit on vocabulary size.
unknown_symbol: String representation for the UNKNOWN symbol.
processor: Which processor to use to process data.
Returns:
Tuple of:
vocab: List of strings representing the computed vocabulary.
str2idx: Map of symbol to index.
str2freq: Map of symbol to frequency.
"""
processed_counts = data.str.strip().value_counts(sort=True)
processed_counts = processor.compute(processed_counts)
full_vocab = processed_counts.index.tolist()
# Only add unknown symbol if num most frequent tokens is less than total number of unique tokens
if num_most_frequent < len(full_vocab):
vocab = [unknown_symbol] + full_vocab[:num_most_frequent]
else:
vocab = full_vocab
str2idx = {unit: i for i, unit in enumerate(vocab)}
str2freq = processed_counts.to_dict()
str2freq = {k: str2freq.get(k, 0) for k in vocab}
return vocab, str2idx, str2freq
def _get_sequence_vector(
sequence, tokenizer, tokenizer_type, format_dtype, unit_to_id, lowercase=True, unknown_symbol=UNKNOWN_SYMBOL
) -> np.ndarray:
unit_sequence = tokenizer(sequence.lower() if lowercase else sequence)
unit_indices_vector = np.empty(len(unit_sequence), dtype=format_dtype)
for i in range(len(unit_sequence)):
curr_unit = unit_sequence[i]
if tokenizer_type == "hf_tokenizer":
unit_indices_vector[i] = curr_unit
else:
if curr_unit in unit_to_id:
unit_indices_vector[i] = unit_to_id[curr_unit]
else:
unit_indices_vector[i] = unit_to_id[unknown_symbol]
# Add start and stop symbols.
# Huggingface's pretrained tokenizers take care of this implicitly:
# https://huggingface.co/docs/transformers/preprocessing
if tokenizer_type != "hf_tokenizer":
unit_indices_vector = np.append(unit_indices_vector, unit_to_id[STOP_SYMBOL])
unit_indices_vector = np.insert(unit_indices_vector, 0, unit_to_id[START_SYMBOL])
return unit_indices_vector
def build_sequence_matrix(
sequences, # pd.core.series.Series
inverse_vocabulary,
tokenizer_type,
length_limit,
padding_symbol=PADDING_SYMBOL,
padding="right",
unknown_symbol=UNKNOWN_SYMBOL,
lowercase=True,
tokenizer_vocab_file=None,
pretrained_model_name_or_path=None,
processor=PANDAS,
) -> np.ndarray:
tokenizer = get_tokenizer_from_registry(tokenizer_type)(
vocab_file=tokenizer_vocab_file,
pretrained_model_name_or_path=pretrained_model_name_or_path,
)
format_dtype = int_type(len(inverse_vocabulary) - 1)
unit_vectors = sequences.map(
lambda sequence: _get_sequence_vector(
sequence,
tokenizer,
tokenizer_type,
format_dtype,
inverse_vocabulary,
lowercase=lowercase,
unknown_symbol=unknown_symbol,
)
)
max_length = processor.compute(unit_vectors.map(len).max())
if max_length < length_limit:
logger.debug(f"max length of {format}: {max_length} < limit: {length_limit}")
max_length = length_limit
if tokenizer_type == "hf_tokenizer":
padding_symbol = tokenizer.get_pad_token()
pad_token_id = tokenizer.convert_token_to_id(padding_symbol)
else:
pad_token_id = inverse_vocabulary[padding_symbol]
def pad(vector):
sequence = np.full((int(max_length),), pad_token_id, dtype=format_dtype)
limit = min(vector.shape[0], max_length)
if padding == "right":
sequence[:limit] = vector[:limit]
else: # if padding == 'left
sequence[max_length - limit :] = vector[:limit]
return sequence
padded = processor.map_objects(unit_vectors, pad)
return padded
def get_tokenizer(tokenizer_type: str, tokenizer_vocab_file: str, pretrained_model_name_or_path: str):
"""Returns a tokenizer object based on the tokenizer type."""
return get_tokenizer_from_registry(tokenizer_type)(
vocab_file=tokenizer_vocab_file,
pretrained_model_name_or_path=pretrained_model_name_or_path,
)
================================================
FILE: ludwig/utils/structural_warning.py
================================================
import warnings
from ludwig.utils.logging_utils import log_once
def warn_structure_refactor(old_module: str, new_module: str, direct: bool = True) -> None:
"""Create structure refactor warning to indicate modules new location post.
Only creates a warning once per module.
"""
old_module = old_module.replace(".py", "")
if log_once(old_module):
warning = (
f"The module `{old_module}` has been moved to `{new_module}` and the old "
f"location will be deprecated soon. Please adjust your imports to point "
f"to the new location."
)
if direct:
warning += f" Example: Do a global search and " f"replace `{old_module}` with `{new_module}`."
else:
warning += (
f"\nATTENTION: This module may have been split or refactored. Please "
f"check the contents of `{new_module}` before making changes."
)
with warnings.catch_warnings():
warnings.simplefilter("always")
warnings.warn(warning, DeprecationWarning, stacklevel=3)
================================================
FILE: ludwig/utils/system_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2022 Predibase, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from dataclasses import dataclass
from ludwig.api_annotations import DeveloperAPI
@DeveloperAPI
@dataclass
class Resources:
cpus: int
gpus: int
================================================
FILE: ludwig/utils/time_utils.py
================================================
#! /usr/bin/env python
# Copyright (c) 2023 Predibase, Inc., 2019 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import logging
import time
from datetime import datetime, timedelta
logger = logging.getLogger(__name__)
class WithTimer:
def __init__(self, title="", quiet=False):
self.title = title
self.quiet = quiet
def elapsed(self):
return time.time() - self.wall, time.process_time() - self.proc
def enter(self):
"""Manually trigger enter."""
self.__enter__()
def __enter__(self):
self.proc = time.process_time()
self.wall = time.time()
return self
def __exit__(self, *args):
if not self.quiet:
elapsed_wp = self.elapsed()
logger.info(f"Elapsed {self.title}: wall {elapsed_wp[0]:.06f}, sys {elapsed_wp[1]:.06f}")
class Timer:
def __init__(self):
self.reset()
def reset(self):
self._proc = time.process_time()
self._wall = time.time()
def elapsed(self):
return self.wall(), self.proc()
def elapsed_str(self):
return strdelta(self.wall() * 1000.0), strdelta(self.proc() * 1000.0)
def wall(self):
return time.time() - self._wall
def proc(self):
return time.process_time() - self._proc
def tic(self):
"""Like Matlab tic/toc for wall time and processor time."""
self.reset()
def toc(self):
"""Like Matlab tic/toc for wall time."""
return self.wall()
def tocproc(self):
"""Like Matlab tic/toc, but for processor time."""
return self.proc()
def timestamp():
return f"{datetime.now():%Y_%m_%d_%H_%M_%S}"
def strdelta(tdelta):
if isinstance(tdelta, (int, float)):
tdelta = timedelta(milliseconds=tdelta)
d = {"D": tdelta.days}
d["H"], rem = divmod(tdelta.seconds, 3600)
d["M"], d["S"] = divmod(rem, 60)
d["f"] = str(tdelta.microseconds)[0:4]
if d["D"] > 0:
t = "{D}d {H}h {M}m {S}.{f}s"
elif d["H"] > 0:
t = "{H}h {M}m {S}.{f}s"
elif d["M"] > 0:
t = "{M}m {S}.{f}s"
else:
t = "{S}.{f}s"
return t.format(**d)
================================================
FILE: ludwig/utils/tokenizers.py
================================================
"""Ludwig string tokenizers including string-based, spacy-based, and huggingface-based implementations.
To add a new tokenizer, 1) implement a subclass of BaseTokenizer and 2) add it to the tokenizer_registry.
Once it's in the registry, tokenizers can be used in a ludwig config, e.g..
```
input_features:
- name: title
type: text
preprocessing:
tokenizer:
```
"""
import logging
import re
from abc import abstractmethod
from typing import Any
import torch
from ludwig.utils.nlp_utils import load_nlp_pipeline, process_text
logger = logging.getLogger(__name__)
SPACE_PUNCTUATION_REGEX = re.compile(r"\w+|[^\w\s]")
COMMA_REGEX = re.compile(r"\s*,\s*")
UNDERSCORE_REGEX = re.compile(r"\s*_\s*")
TORCHSCRIPT_COMPATIBLE_TOKENIZERS = {"space", "space_punct"}
class BaseTokenizer:
@abstractmethod
def __init__(self, **kwargs):
pass
@abstractmethod
def __call__(self, text: str):
pass
class CharactersToListTokenizer(BaseTokenizer):
def __call__(self, text):
return [char for char in text]
class SpaceStringToListTokenizer(torch.nn.Module):
"""Implements torchscript-compatible whitespace tokenization."""
def __init__(self, **kwargs):
super().__init__()
def forward(self, v: str | list[str] | torch.Tensor) -> Any:
if isinstance(v, torch.Tensor):
raise ValueError(f"Unsupported input: {v}")
inputs: list[str] = []
# Ludwig calls map on List[str] objects, so we need to handle individual strings as well.
if isinstance(v, str):
inputs.append(v)
else:
inputs.extend(v)
tokens: list[list[str]] = []
for sequence in inputs:
split_sequence = sequence.strip().split(" ")
token_sequence: list[str] = []
for token in split_sequence:
if len(token) > 0:
token_sequence.append(token)
tokens.append(token_sequence)
return tokens[0] if isinstance(v, str) else tokens
class SpacePunctuationStringToListTokenizer(torch.nn.Module):
"""Implements torchscript-compatible space_punct tokenization."""
def __init__(self, **kwargs):
super().__init__()
def is_regex_w(self, c: str) -> bool:
return c.isalnum() or c == "_"
def forward(self, v: str | list[str] | torch.Tensor) -> Any:
if isinstance(v, torch.Tensor):
raise ValueError(f"Unsupported input: {v}")
inputs: list[str] = []
# Ludwig calls map on List[str] objects, so we need to handle individual strings as well.
if isinstance(v, str):
inputs.append(v)
else:
inputs.extend(v)
tokens: list[list[str]] = []
for sequence in inputs:
token_sequence: list[str] = []
word: list[str] = []
for c in sequence:
if self.is_regex_w(c):
word.append(c)
elif len(word) > 0: # if non-empty word and non-alphanumeric char, append word to token sequence
token_sequence.append("".join(word))
word.clear()
if not self.is_regex_w(c) and not c.isspace(): # non-alphanumeric, non-space char is punctuation
token_sequence.append(c)
if len(word) > 0: # add last word
token_sequence.append("".join(word))
tokens.append(token_sequence)
return tokens[0] if isinstance(v, str) else tokens
class StringSplitTokenizer(BaseTokenizer):
"""Splits a string by a given separator."""
def __init__(self, separator: str = " ", **kwargs):
self.separator = separator
def __call__(self, text):
return text.split(self.separator)
class NgramTokenizer(BaseTokenizer):
"""Tokenizes text into unigrams + ngrams up to n."""
def __init__(self, n: int = 2, **kwargs):
self.n = n
def __call__(self, text):
tokens = text.strip().split()
result = list(tokens)
for i in range(2, self.n + 1):
for j in range(len(tokens) - i + 1):
result.append(" ".join(tokens[j : j + i]))
return result
class UnderscoreStringToListTokenizer(BaseTokenizer):
def __call__(self, text):
return UNDERSCORE_REGEX.split(text.strip())
class CommaStringToListTokenizer(BaseTokenizer):
def __call__(self, text):
return COMMA_REGEX.split(text.strip())
class UntokenizedStringToListTokenizer(BaseTokenizer):
def __call__(self, text):
return [text]
class StrippedStringToListTokenizer(BaseTokenizer):
def __call__(self, text):
return [text.strip()]
class EnglishTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("en"))
class EnglishFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("en"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class EnglishRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("en"), filter_stopwords=True)
class EnglishLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
process_text(text, load_nlp_pipeline("en"), return_lemma=True)
class EnglishLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("en"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class EnglishLemmatizeRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("en"), return_lemma=True, filter_stopwords=True)
class ItalianTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("it"))
class ItalianFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("it"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class ItalianRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("it"), filter_stopwords=True)
class ItalianLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("it"), return_lemma=True)
class ItalianLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("it"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class ItalianLemmatizeRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("it"), return_lemma=True, filter_stopwords=True)
class SpanishTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("es"))
class SpanishFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("es"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class SpanishRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("es"), filter_stopwords=True)
class SpanishLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("es"), return_lemma=True)
class SpanishLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("es"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class SpanishLemmatizeRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("es"), return_lemma=True, filter_stopwords=True)
class GermanTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("de"))
class GermanFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("de"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class GermanRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("de"), filter_stopwords=True)
class GermanLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("de"), return_lemma=True)
class GermanLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("de"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class GermanLemmatizeRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("de"), return_lemma=True, filter_stopwords=True)
class FrenchTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("fr"))
class FrenchFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("fr"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class FrenchRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("fr"), filter_stopwords=True)
class FrenchLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("fr"), return_lemma=True)
class FrenchLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("fr"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class FrenchLemmatizeRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("fr"), return_lemma=True, filter_stopwords=True)
class PortugueseTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("pt"))
class PortugueseFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("pt"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class PortugueseRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("pt"), filter_stopwords=True)
class PortugueseLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("pt"), return_lemma=True)
class PortugueseLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("pt"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class PortugueseLemmatizeRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("pt"), return_lemma=True, filter_stopwords=True)
class DutchTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("nl"))
class DutchFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("nl"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class DutchRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("nl"), filter_stopwords=True)
class DutchLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("nl"), return_lemma=True)
class DutchLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("nl"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class DutchLemmatizeRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("nl"), return_lemma=True, filter_stopwords=True)
class GreekTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("el"))
class GreekFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("el"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class GreekRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("el"), filter_stopwords=True)
class GreekLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("el"), return_lemma=True)
class GreekLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("el"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class GreekLemmatizeRemoveStopwordsFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("el"), return_lemma=True, filter_stopwords=True)
class NorwegianTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("nb"))
class NorwegianFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("nb"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class NorwegianRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("nb"), filter_stopwords=True)
class NorwegianLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("nb"), return_lemma=True)
class NorwegianLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("nb"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class NorwegianLemmatizeRemoveStopwordsFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("nb"), return_lemma=True, filter_stopwords=True)
class LithuanianTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("lt"))
class LithuanianFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("lt"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class LithuanianRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("lt"), filter_stopwords=True)
class LithuanianLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("lt"), return_lemma=True)
class LithuanianLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("lt"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class LithuanianLemmatizeRemoveStopwordsFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("lt"), return_lemma=True, filter_stopwords=True)
class DanishTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("da"))
class DanishFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("da"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class DanishRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("da"), filter_stopwords=True)
class DanishLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("da"), return_lemma=True)
class DanishLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("da"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class DanishLemmatizeRemoveStopwordsFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("da"), return_lemma=True, filter_stopwords=True)
class PolishTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("pl"))
class PolishFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("pl"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class PolishRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("pl"), filter_stopwords=True)
class PolishLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("pl"), return_lemma=True)
class PolishLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("pl"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class PolishLemmatizeRemoveStopwordsFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("pl"), return_lemma=True, filter_stopwords=True)
class RomanianTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("ro"))
class RomanianFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("ro"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class RomanianRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("ro"), filter_stopwords=True)
class RomanianLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("ro"), return_lemma=True)
class RomanianLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("ro"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class RomanianLemmatizeRemoveStopwordsFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("ro"), return_lemma=True, filter_stopwords=True)
class JapaneseTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("jp"))
class JapaneseFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("jp"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class JapaneseRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("jp"), filter_stopwords=True)
class JapaneseLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("jp"), return_lemma=True)
class JapaneseLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("jp"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class JapaneseLemmatizeRemoveStopwordsFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("jp"), return_lemma=True, filter_stopwords=True)
class ChineseTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("zh"))
class ChineseFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("zh"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class ChineseRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("zh"), filter_stopwords=True)
class ChineseLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("zh"), return_lemma=True)
class ChineseLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("zh"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class ChineseLemmatizeRemoveStopwordsFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("zh"), return_lemma=True, filter_stopwords=True)
class MultiTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("xx"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class MultiFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text, load_nlp_pipeline("xx"), filter_numbers=True, filter_punctuation=True, filter_short_tokens=True
)
class MultiRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("xx"), filter_stopwords=True)
class MultiLemmatizeTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("xx"), return_lemma=True)
class MultiLemmatizeFilterTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(
text,
load_nlp_pipeline("xx"),
return_lemma=True,
filter_numbers=True,
filter_punctuation=True,
filter_short_tokens=True,
)
class MultiLemmatizeRemoveStopwordsTokenizer(BaseTokenizer):
def __call__(self, text):
return process_text(text, load_nlp_pipeline("xx"), return_lemma=True, filter_stopwords=True)
class HFTokenizer(BaseTokenizer):
def __init__(self, pretrained_model_name_or_path, **kwargs):
super().__init__()
from transformers import AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=kwargs.get("trust_remote_code", False),
)
# Some models (e.g. LLaMA) don't have a pad_token by default.
# Set it to eos_token to avoid NoneType errors in preprocessing.
if self.tokenizer.pad_token is None and self.tokenizer.eos_token is not None:
self.tokenizer.pad_token = self.tokenizer.eos_token
def __call__(self, text):
return self.tokenizer.encode(text, truncation=True)
def get_vocab(self):
return self.tokenizer.get_vocab()
def get_pad_token(self) -> str:
return self.tokenizer.pad_token
def get_unk_token(self) -> str:
return self.tokenizer.unk_token
def convert_token_to_id(self, token: str) -> int:
if token is None:
return 0
return self.tokenizer.convert_tokens_to_ids(token)
tokenizer_registry = {
# Torchscript-compatible tokenizers.
"space": SpaceStringToListTokenizer,
"space_punct": SpacePunctuationStringToListTokenizer,
# Tokenizers not compatible with torchscript
"characters": CharactersToListTokenizer,
"underscore": UnderscoreStringToListTokenizer,
"comma": CommaStringToListTokenizer,
"untokenized": UntokenizedStringToListTokenizer,
"stripped": StrippedStringToListTokenizer,
"english_tokenize": EnglishTokenizer,
"english_tokenize_filter": EnglishFilterTokenizer,
"english_tokenize_remove_stopwords": EnglishRemoveStopwordsTokenizer,
"english_lemmatize": EnglishLemmatizeTokenizer,
"english_lemmatize_filter": EnglishLemmatizeFilterTokenizer,
"english_lemmatize_remove_stopwords": EnglishLemmatizeRemoveStopwordsTokenizer,
"italian_tokenize": ItalianTokenizer,
"italian_tokenize_filter": ItalianFilterTokenizer,
"italian_tokenize_remove_stopwords": ItalianRemoveStopwordsTokenizer,
"italian_lemmatize": ItalianLemmatizeTokenizer,
"italian_lemmatize_filter": ItalianLemmatizeFilterTokenizer,
"italian_lemmatize_remove_stopwords": ItalianLemmatizeRemoveStopwordsTokenizer,
"spanish_tokenize": SpanishTokenizer,
"spanish_tokenize_filter": SpanishFilterTokenizer,
"spanish_tokenize_remove_stopwords": SpanishRemoveStopwordsTokenizer,
"spanish_lemmatize": SpanishLemmatizeTokenizer,
"spanish_lemmatize_filter": SpanishLemmatizeFilterTokenizer,
"spanish_lemmatize_remove_stopwords": SpanishLemmatizeRemoveStopwordsTokenizer,
"german_tokenize": GermanTokenizer,
"german_tokenize_filter": GermanFilterTokenizer,
"german_tokenize_remove_stopwords": GermanRemoveStopwordsTokenizer,
"german_lemmatize": GermanLemmatizeTokenizer,
"german_lemmatize_filter": GermanLemmatizeFilterTokenizer,
"german_lemmatize_remove_stopwords": GermanLemmatizeRemoveStopwordsTokenizer,
"french_tokenize": FrenchTokenizer,
"french_tokenize_filter": FrenchFilterTokenizer,
"french_tokenize_remove_stopwords": FrenchRemoveStopwordsTokenizer,
"french_lemmatize": FrenchLemmatizeTokenizer,
"french_lemmatize_filter": FrenchLemmatizeFilterTokenizer,
"french_lemmatize_remove_stopwords": FrenchLemmatizeRemoveStopwordsTokenizer,
"portuguese_tokenize": PortugueseTokenizer,
"portuguese_tokenize_filter": PortugueseFilterTokenizer,
"portuguese_tokenize_remove_stopwords": PortugueseRemoveStopwordsTokenizer,
"portuguese_lemmatize": PortugueseLemmatizeTokenizer,
"portuguese_lemmatize_filter": PortugueseLemmatizeFilterTokenizer,
"portuguese_lemmatize_remove_stopwords": PortugueseLemmatizeRemoveStopwordsTokenizer,
"dutch_tokenize": DutchTokenizer,
"dutch_tokenize_filter": DutchFilterTokenizer,
"dutch_tokenize_remove_stopwords": DutchRemoveStopwordsTokenizer,
"dutch_lemmatize": DutchLemmatizeTokenizer,
"dutch_lemmatize_filter": DutchLemmatizeFilterTokenizer,
"dutch_lemmatize_remove_stopwords": DutchLemmatizeRemoveStopwordsTokenizer,
"greek_tokenize": GreekTokenizer,
"greek_tokenize_filter": GreekFilterTokenizer,
"greek_tokenize_remove_stopwords": GreekRemoveStopwordsTokenizer,
"greek_lemmatize": GreekLemmatizeTokenizer,
"greek_lemmatize_filter": GreekLemmatizeFilterTokenizer,
"greek_lemmatize_remove_stopwords": GreekLemmatizeRemoveStopwordsFilterTokenizer,
"norwegian_tokenize": NorwegianTokenizer,
"norwegian_tokenize_filter": NorwegianFilterTokenizer,
"norwegian_tokenize_remove_stopwords": NorwegianRemoveStopwordsTokenizer,
"norwegian_lemmatize": NorwegianLemmatizeTokenizer,
"norwegian_lemmatize_filter": NorwegianLemmatizeFilterTokenizer,
"norwegian_lemmatize_remove_stopwords": NorwegianLemmatizeRemoveStopwordsFilterTokenizer,
"lithuanian_tokenize": LithuanianTokenizer,
"lithuanian_tokenize_filter": LithuanianFilterTokenizer,
"lithuanian_tokenize_remove_stopwords": LithuanianRemoveStopwordsTokenizer,
"lithuanian_lemmatize": LithuanianLemmatizeTokenizer,
"lithuanian_lemmatize_filter": LithuanianLemmatizeFilterTokenizer,
"lithuanian_lemmatize_remove_stopwords": LithuanianLemmatizeRemoveStopwordsFilterTokenizer,
"danish_tokenize": DanishTokenizer,
"danish_tokenize_filter": DanishFilterTokenizer,
"danish_tokenize_remove_stopwords": DanishRemoveStopwordsTokenizer,
"danish_lemmatize": DanishLemmatizeTokenizer,
"danish_lemmatize_filter": DanishLemmatizeFilterTokenizer,
"danish_lemmatize_remove_stopwords": DanishLemmatizeRemoveStopwordsFilterTokenizer,
"polish_tokenize": PolishTokenizer,
"polish_tokenize_filter": PolishFilterTokenizer,
"polish_tokenize_remove_stopwords": PolishRemoveStopwordsTokenizer,
"polish_lemmatize": PolishLemmatizeTokenizer,
"polish_lemmatize_filter": PolishLemmatizeFilterTokenizer,
"polish_lemmatize_remove_stopwords": PolishLemmatizeRemoveStopwordsFilterTokenizer,
"romanian_tokenize": RomanianTokenizer,
"romanian_tokenize_filter": RomanianFilterTokenizer,
"romanian_tokenize_remove_stopwords": RomanianRemoveStopwordsTokenizer,
"romanian_lemmatize": RomanianLemmatizeTokenizer,
"romanian_lemmatize_filter": RomanianLemmatizeFilterTokenizer,
"romanian_lemmatize_remove_stopwords": RomanianLemmatizeRemoveStopwordsFilterTokenizer,
"japanese_tokenize": JapaneseTokenizer,
"japanese_tokenize_filter": JapaneseFilterTokenizer,
"japanese_tokenize_remove_stopwords": JapaneseRemoveStopwordsTokenizer,
"japanese_lemmatize": JapaneseLemmatizeTokenizer,
"japanese_lemmatize_filter": JapaneseLemmatizeFilterTokenizer,
"japanese_lemmatize_remove_stopwords": JapaneseLemmatizeRemoveStopwordsFilterTokenizer,
"chinese_tokenize": ChineseTokenizer,
"chinese_tokenize_filter": ChineseFilterTokenizer,
"chinese_tokenize_remove_stopwords": ChineseRemoveStopwordsTokenizer,
"chinese_lemmatize": ChineseLemmatizeTokenizer,
"chinese_lemmatize_filter": ChineseLemmatizeFilterTokenizer,
"chinese_lemmatize_remove_stopwords": ChineseLemmatizeRemoveStopwordsFilterTokenizer,
"multi_tokenize": MultiTokenizer,
"multi_tokenize_filter": MultiFilterTokenizer,
"multi_tokenize_remove_stopwords": MultiRemoveStopwordsTokenizer,
"multi_lemmatize": MultiLemmatizeTokenizer,
"multi_lemmatize_filter": MultiLemmatizeFilterTokenizer,
"multi_lemmatize_remove_stopwords": MultiLemmatizeRemoveStopwordsTokenizer,
}
class SentencePieceTokenizer(torch.nn.Module):
"""SentencePiece tokenizer using HuggingFace transformers (XLMR-based)."""
def __init__(self, pretrained_model_name_or_path: str | None = None, **kwargs):
super().__init__()
from transformers import AutoTokenizer
if pretrained_model_name_or_path is None:
pretrained_model_name_or_path = "xlm-roberta-base"
self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
def forward(self, v: str | list[str] | torch.Tensor):
if isinstance(v, torch.Tensor):
raise ValueError(f"Unsupported input: {v}")
if isinstance(v, str):
return self.tokenizer.tokenize(v)
return [self.tokenizer.tokenize(s) for s in v]
class CLIPTokenizer(torch.nn.Module):
"""CLIP tokenizer using HuggingFace transformers."""
def __init__(self, pretrained_model_name_or_path: str | None = None, **kwargs):
super().__init__()
from transformers import CLIPTokenizer as HFCLIPTokenizer
if pretrained_model_name_or_path is None:
pretrained_model_name_or_path = "openai/clip-vit-base-patch32"
self.tokenizer = HFCLIPTokenizer.from_pretrained(pretrained_model_name_or_path)
def __call__(self, text):
if isinstance(text, str):
return self.tokenizer.tokenize(text)
return [self.tokenizer.tokenize(t) for t in text]
def get_vocab(self):
return self.tokenizer.get_vocab()
class GPT2BPETokenizer(torch.nn.Module):
"""GPT-2 BPE tokenizer using HuggingFace transformers."""
def __init__(self, pretrained_model_name_or_path: str | None = None, **kwargs):
super().__init__()
from transformers import GPT2Tokenizer
if pretrained_model_name_or_path is None:
pretrained_model_name_or_path = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path)
def __call__(self, text):
if isinstance(text, str):
return self.tokenizer.tokenize(text)
return [self.tokenizer.tokenize(t) for t in text]
def get_vocab(self):
return self.tokenizer.get_vocab()
class BERTTokenizer(torch.nn.Module):
"""BERT tokenizer using HuggingFace transformers."""
def __init__(
self,
vocab_file: str | None = None,
pretrained_model_name_or_path: str | None = None,
is_hf_tokenizer: bool | None = False,
do_lower_case: bool | None = None,
**kwargs,
):
super().__init__()
from transformers import BertTokenizer
if pretrained_model_name_or_path is None:
pretrained_model_name_or_path = "bert-base-uncased"
tokenizer_kwargs = {}
if do_lower_case is not None:
tokenizer_kwargs["do_lower_case"] = do_lower_case
self.tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path, **tokenizer_kwargs)
self.is_hf_tokenizer = is_hf_tokenizer
self.pad_token = self.tokenizer.pad_token
self.unk_token = self.tokenizer.unk_token
self.cls_token_id = self.tokenizer.cls_token_id
self.sep_token_id = self.tokenizer.sep_token_id
def __call__(self, text):
if isinstance(text, str):
texts = [text]
else:
texts = text
if self.is_hf_tokenizer:
results = [self.tokenizer.encode(t) for t in texts]
else:
results = [self.tokenizer.tokenize(t) for t in texts]
return results[0] if isinstance(text, str) else results
def get_vocab(self):
return self.tokenizer.get_vocab()
def get_pad_token(self) -> str:
return self.pad_token
def get_unk_token(self) -> str:
return self.unk_token
def convert_token_to_id(self, token: str) -> int:
if token is None:
return 0
return self.tokenizer.convert_tokens_to_ids(token)
tokenizer_registry.update(
{
"sentencepiece": SentencePieceTokenizer,
"clip": CLIPTokenizer,
"gpt2bpe": GPT2BPETokenizer,
"bert": BERTTokenizer,
}
)
def get_hf_tokenizer(pretrained_model_name_or_path, **kwargs):
"""Gets a HuggingFace-based tokenizer that follows HF convention.
Args:
pretrained_model_name_or_path: Name of the model in the HF repo. Example: "bert-base-uncased".
Returns:
A HF tokenizer.
"""
model_name_lower = pretrained_model_name_or_path.lower()
# Use BERTTokenizer only for actual BERT models, not for models like albert/roberta
# that have "bert" in their name but use different tokenization (SentencePiece, BPE, etc.)
if "bert" in model_name_lower and not any(x in model_name_lower for x in ("albert", "roberta", "distilbert")):
logger.info(f"Loading BERT tokenizer for {pretrained_model_name_or_path}")
return BERTTokenizer(pretrained_model_name_or_path=pretrained_model_name_or_path, is_hf_tokenizer=True)
logger.info(f"Loading HuggingFace tokenizer for {pretrained_model_name_or_path}")
return HFTokenizer(pretrained_model_name_or_path)
tokenizer_registry.update(
{
"hf_tokenizer": get_hf_tokenizer,
}
)
def get_tokenizer_from_registry(tokenizer_name: str) -> torch.nn.Module:
"""Returns the appropriate tokenizer from the tokenizer registry."""
if tokenizer_name in tokenizer_registry:
return tokenizer_registry[tokenizer_name]
raise KeyError(f"Invalid tokenizer name: '{tokenizer_name}'. Available tokenizers: {tokenizer_registry.keys()}")
================================================
FILE: ludwig/utils/torch_utils.py
================================================
import math
import os
import warnings
from abc import abstractmethod
from functools import lru_cache
import torch
from torch import nn
from torch.nn import Module, ModuleDict
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import ENCODER_OUTPUT
from ludwig.utils.strings_utils import SpecialSymbol
_TORCH_INIT_PARAMS: tuple | None = None
@DeveloperAPI
def get_torch_device():
if torch.cuda.is_available() and torch.cuda.device_count() > 0:
# Use cublasLt for batched GEMM operations. The default cublas library has known
# bugs with cublasSgemmStridedBatched on certain GPU/driver combinations.
torch.backends.cuda.preferred_blas_library("cublaslt")
return "cuda"
if bool(os.environ.get("LUDWIG_ENABLE_MPS")):
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
if not bool(os.environ.get("PYTORCH_ENABLE_MPS_FALLBACK")):
warnings.warn(
"LUDWIG_ENABLE_MPS is set and MPS is available, but PYTORCH_ENABLE_MPS_FALLBACK has not been set. "
"Depending on your model config, some operations may not be compatible. If errors occur, try "
"setting `PYTORCH_ENABLE_MPS_FALLBACK=1` and resubmitting."
)
return "mps"
else:
warnings.warn("LUDWIG_ENABLE_MPS is set but MPS is not available, falling back to CPU.")
return "cpu"
DEVICE = get_torch_device()
@DeveloperAPI
def place_on_device(x, device):
"""Recursively places the input on the specified device."""
if isinstance(x, list):
return [place_on_device(xi, device) for xi in x]
elif isinstance(x, dict):
return {k: place_on_device(v, device) for k, v in x.items()}
elif isinstance(x, set):
return {place_on_device(xi, device) for xi in x}
elif isinstance(x, tuple):
return tuple(place_on_device(xi, device) for xi in x)
elif isinstance(x, torch.Tensor):
return x.to(device)
else:
return x
@DeveloperAPI
def sequence_length_2D(sequence: torch.Tensor) -> torch.Tensor:
"""Returns the number of non-padding elements per sequence in batch.
:param sequence: (torch.Tensor) A 2D tensor of shape [batch size x max sequence length]. # Return
:returns: (torch.Tensor) The count on non-zero elements per sequence.
"""
used = (sequence != SpecialSymbol.PADDING.value).type(torch.int32)
length = torch.sum(used, 1)
return length
@DeveloperAPI
def sequence_length_3D(sequence: torch.Tensor) -> torch.Tensor:
"""Returns the number of non-zero elements per sequence in batch.
:param sequence: (torch.Tensor) A 3D tensor of shape [batch size x max sequence length x hidden size]. # Return
:returns: (torch.Tensor) The count on non-zero elements per sequence.
"""
used = torch.sign(torch.amax(torch.abs(sequence), dim=2))
length = torch.sum(used, 1)
length = length.int()
return length
@DeveloperAPI
def sequence_mask(lengths: torch.Tensor, maxlen: int | None = None, dtype: torch.dtype = torch.bool):
"""Returns a mask of shape (batch_size x maxlen), where mask[i] is True for each element up to lengths[i],
otherwise False i.e. if maxlen=5 and lengths[i] = 3, mask[i] = [True, True True, False False].
:param lengths: (torch.Tensor) A 1d integer tensor of shape [batch size].
:param maxlen: (Optional[int]) The maximum sequence length. If not specified, the max(lengths) is used.
:param dtype: (type) The type to output. # Return
:returns: (torch.Tensor) A sequence mask tensor of shape (batch_size x maxlen).
"""
if maxlen is None:
maxlen = lengths.max()
matrix = torch.unsqueeze(lengths, dim=-1)
row_vector = torch.arange(0, maxlen, 1, device=lengths.device)
mask = row_vector < matrix
mask = mask.type(dtype)
return mask
@DeveloperAPI
def periodic(inputs: torch.Tensor, period: int) -> torch.Tensor:
"""Returns periodic representation assuming 0 is start of period."""
return torch.cos(inputs * 2 * math.pi / period)
initializer_registry = {
"uniform": nn.init.uniform_,
"normal": nn.init.normal_,
"constant": nn.init.constant_,
"ones": nn.init.ones_,
"zeros": nn.init.zeros_,
"eye": nn.init.eye_,
"dirac": nn.init.dirac_,
"xavier_uniform": nn.init.xavier_uniform_,
"xavier_normal": nn.init.xavier_normal_,
"kaiming_uniform": nn.init.kaiming_uniform_,
"kaiming_normal": nn.init.kaiming_normal_,
"orthogonal": nn.init.orthogonal_,
"sparse": nn.init.sparse_,
"identity": nn.init.eye_,
}
activations = {
"elu": nn.ELU,
"leakyRelu": nn.LeakyReLU,
"logSigmoid": nn.LogSigmoid,
"relu": nn.ReLU,
"sigmoid": nn.Sigmoid,
"tanh": nn.Tanh,
"softmax": nn.Softmax,
None: nn.Identity,
}
@DeveloperAPI
def get_activation(activation):
return activations[activation]()
@DeveloperAPI
def reg_loss(model: nn.Module, regularizer: str, l1: float = 0.01, l2: float = 0.01):
"""Computes the regularization loss for a given model.
Parameters:
model: torch.nn.Module object to compute regularization loss for.
regularizer: regularizer to use (currently l1, l2 and l1_l2 supported).
l1: L1 regularization coefficient.
l2: L2 regularization coefficient.
Returns:
Regularization loss for the model (float).
"""
if regularizer == "l1":
l1_reg = l1 * sum(torch.abs(p).sum() for p in model.parameters())
return l1_reg
if regularizer == "l2":
l2_reg = l2 * sum(torch.square(p).sum() for p in model.parameters())
return l2_reg
if regularizer == "l1_l2":
l1_reg = l1 * sum(torch.abs(p).sum() for p in model.parameters())
l2_reg = l2 * sum(torch.square(p).sum() for p in model.parameters())
return l1_reg + l2_reg
@DeveloperAPI
class LudwigModule(Module):
def __init__(self):
super().__init__()
self._losses = {}
self.register_buffer("device_tensor", torch.zeros(0), persistent=False)
@property
def device(self):
return self.device_tensor.device
def prepare_for_training(self):
"""This is called from within the Trainer object to do any final instantiation before model training."""
def losses(self):
collected_losses = []
for loss in self._losses.values():
collected_losses.append(loss)
for child in self.children():
if isinstance(child, LudwigModule):
collected_losses.extend(child.losses())
elif isinstance(child, ModuleDict):
for c in child.values():
if hasattr(c, "losses"): # Some modules, i.e. SequenceReducers, don't have losses.
collected_losses.extend(c.losses())
elif isinstance(child, Module):
pass
else:
raise ValueError
return collected_losses
def update_loss(self, key: str, loss: torch.Tensor):
"""This should be called in the forward pass to add a custom loss term to the combined loss."""
self._losses[key] = loss
@property
def input_dtype(self):
return torch.float32
@property
@abstractmethod
def input_shape(self) -> torch.Size:
"""Returns size of the input tensor without the batch dimension."""
# raise NotImplementedError("Abstract class.")
@property
def output_shape(self) -> torch.Size:
"""Returns size of the output tensor without the batch dimension."""
return self._computed_output_shape()
@lru_cache(maxsize=1)
def _computed_output_shape(self) -> torch.Size:
dummy_input = torch.rand(2, *self.input_shape, device=self.device)
output_tensor = self.forward(dummy_input.type(self.input_dtype))
if isinstance(output_tensor, torch.Tensor):
return output_tensor.size()[1:]
elif isinstance(output_tensor, dict) and ENCODER_OUTPUT in output_tensor:
return output_tensor[ENCODER_OUTPUT].size()[1:]
else:
raise ValueError("Unknown output tensor type.")
def freeze_parameters(module: nn.Module):
"""Freezes the parameters of a torch module."""
for p in module.parameters():
p.requires_grad = False
@DeveloperAPI
class FreezeModule(nn.Module):
def __init__(self, module: nn.Module, frozen: bool):
super().__init__()
if frozen:
freeze_parameters(module)
module.eval()
else:
module.train()
self.module = module
self.frozen = frozen
def train(self, mode: bool = True):
if self.frozen:
# Ignores any attempt to set params trainable
return self
return super().train(mode)
@DeveloperAPI
class Dense(LudwigModule):
def __init__(
self,
input_size,
output_size,
use_bias=True,
weights_initializer="xavier_uniform",
bias_initializer="zeros",
):
super().__init__()
self.dense = nn.Linear(in_features=input_size, out_features=output_size, bias=use_bias)
weights_initializer = initializer_registry[weights_initializer]
weights_initializer(self.dense.weight)
if use_bias:
bias_initializer = initializer_registry[bias_initializer]
bias_initializer(self.dense.bias)
@property
def input_shape(self) -> torch.Size:
return self.dense.input_shape
def forward(self, input: torch.Tensor) -> torch.Tensor:
output = torch.squeeze(self.dense(input), dim=-1)
return output
@DeveloperAPI
def initialize_pytorch(
gpus: int | str | list[int] | None = None,
gpu_memory_limit: float | None = None,
allow_parallel_threads: bool = True,
):
param_tuple = (gpus, gpu_memory_limit, allow_parallel_threads)
if _TORCH_INIT_PARAMS is not None:
if _TORCH_INIT_PARAMS != param_tuple:
warnings.warn(
"PyTorch has already been initialized. Changes to `gpus`, "
"`gpu_memory_limit`, and `allow_parallel_threads` will be ignored. "
"Start a new Python process to modify these values."
)
return
# For reproducivility / determinism, set parallel threads to 1.
# For performance, leave unset to allow PyTorch to select the best value automatically.
if not allow_parallel_threads:
torch.set_num_threads(1)
torch.set_num_interop_threads(1)
if torch.cuda.is_available() and torch.cuda.device_count() > 0:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
if isinstance(gpus, int):
gpus = [gpus]
elif isinstance(gpus, str):
gpus = gpus.strip()
gpus = [int(g) for g in gpus.split(",")]
if gpus and len(gpus) == 1 and gpus[0] == -1:
# CUDA_VISIBLE_DEVICES syntax for disabling all GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = ""
elif torch.cuda.is_available() and torch.cuda.device_count() > 0:
# Set visible devices so GPU utilization is isolated
# (no GPU contention between workers).
if gpus is not None:
if len(gpus) == 1:
torch.cuda.set_device(gpus[0])
elif len(gpus) > 1:
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(i) for i in gpus)
# Limit the amount of memory that can be consumed per GPU
if gpu_memory_limit is not None:
for gpu in gpus or range(torch.cuda.device_count()):
torch.cuda.memory.set_per_process_memory_fraction(gpu_memory_limit, gpu)
_set_torch_init_params(param_tuple)
def _set_torch_init_params(params: tuple | None):
global _TORCH_INIT_PARAMS
_TORCH_INIT_PARAMS = params
def _get_torch_init_params() -> tuple | None:
return _TORCH_INIT_PARAMS
@DeveloperAPI
def model_size(model: nn.Module):
"""Computes PyTorch model size in bytes."""
size = 0
size += sum(param.nelement() * param.element_size() for param in model.parameters())
size += sum(buffer.nelement() * buffer.element_size() for buffer in model.buffers())
return size
================================================
FILE: ludwig/utils/trainer_utils.py
================================================
import logging
import re
from collections import defaultdict
from typing import TYPE_CHECKING
try:
from typing import Literal
except ImportError:
from typing import Literal
from ludwig.api_annotations import DeveloperAPI
from ludwig.constants import AUTO, COMBINED, LOSS
from ludwig.models.base import BaseModel
from ludwig.models.ecd import ECD
from ludwig.models.llm import LLM
from ludwig.modules.metric_modules import get_best_function
from ludwig.schema.trainer import ECDTrainerConfig, FineTuneTrainerConfig
from ludwig.utils.data_utils import save_json
from ludwig.utils.metric_utils import TrainerMetric
if TYPE_CHECKING:
from ludwig.features.base_feature import OutputFeature
from ludwig.schema.trainer import BaseTrainerConfig
logger = logging.getLogger(__name__)
@DeveloperAPI
def initialize_trainer_metric_dict(output_features) -> dict[str, dict[str, list[TrainerMetric]]]:
"""Returns a dict of dict of metrics, output_feature_name -> metric_name -> List[TrainerMetric]."""
metrics = defaultdict(lambda: defaultdict(list))
return metrics
def get_latest_metrics_dict(
progress_tracker_metrics: dict[str, dict[str, list[TrainerMetric]]],
) -> dict[str, dict[str, float]]:
"""Returns a dict of field name -> metric name -> latest metric value."""
latest_metrics_dict = defaultdict(dict)
for feature_name, metrics_dict in progress_tracker_metrics.items():
for metric_name, metrics in metrics_dict.items():
if metrics:
# Metrics may be missing if computing metrics was excepted, if the metrics are entirely empty
# due to a missing subset, or if evaluate_training_set is False.
latest_metrics_dict[feature_name][metric_name] = metrics[-1][-1]
return latest_metrics_dict
@DeveloperAPI
def get_new_progress_tracker(
batch_size: int,
best_eval_metric_value: float,
best_increase_batch_size_eval_metric: float,
learning_rate: float,
output_features: dict[str, "OutputFeature"],
):
"""Returns a new instance of a ProgressTracker with empty metrics."""
return ProgressTracker(
epoch=0,
batch_size=batch_size,
steps=0,
tune_checkpoint_num=0,
checkpoint_number=0,
best_eval_metric_steps=0,
best_eval_metric_epoch=0,
best_eval_metric_checkpoint_number=0,
last_learning_rate_reduction_steps=0,
last_increase_batch_size_steps=0,
last_improvement_steps=0,
best_eval_metric_value=best_eval_metric_value,
best_increase_batch_size_eval_metric=best_increase_batch_size_eval_metric,
last_increase_batch_size_eval_metric_improvement=0,
learning_rate=learning_rate,
num_reductions_learning_rate=0,
num_increases_batch_size=0,
train_metrics=initialize_trainer_metric_dict(output_features),
validation_metrics=initialize_trainer_metric_dict(output_features),
test_metrics=initialize_trainer_metric_dict(output_features),
last_learning_rate_reduction=0,
last_increase_batch_size=0,
best_eval_train_metrics={},
best_eval_validation_metrics={},
best_eval_test_metrics={},
llm_eval_examples={},
checkpoint_to_step={},
checkpoint_to_epoch={},
incremental_step_token_usage={},
cumulative_step_token_usage={},
incremental_checkpoint_token_usage={},
cumulative_checkpoint_token_usage={},
total_tokens_used=0,
)
@DeveloperAPI
class ProgressTracker:
def __init__(
self,
epoch: int,
batch_size: int,
steps: int,
tune_checkpoint_num: int,
checkpoint_number: int,
best_eval_metric_steps: int,
best_eval_metric_epoch: int,
best_eval_metric_checkpoint_number: int,
last_improvement_steps: int,
last_learning_rate_reduction_steps: int,
last_increase_batch_size_steps: int,
best_eval_metric_value: float,
best_increase_batch_size_eval_metric: float,
last_increase_batch_size_eval_metric_improvement: int,
learning_rate: float,
num_reductions_learning_rate: int,
num_increases_batch_size: int,
train_metrics: dict[str, dict[str, list[TrainerMetric]]],
validation_metrics: dict[str, dict[str, list[TrainerMetric]]],
test_metrics: dict[str, dict[str, list[TrainerMetric]]],
last_learning_rate_reduction: int,
last_increase_batch_size: int,
best_eval_train_metrics: dict[str, dict[str, float]],
best_eval_validation_metrics: dict[str, dict[str, float]],
best_eval_test_metrics: dict[str, dict[str, float]],
llm_eval_examples: dict[str, list[str]] = None,
checkpoint_to_step: dict[str, int] = None,
checkpoint_to_epoch: dict[str, int] = None,
incremental_step_token_usage: dict[str, int] = None,
cumulative_step_token_usage: dict[str, int] = None,
incremental_checkpoint_token_usage: dict[str, int] = None,
cumulative_checkpoint_token_usage: dict[str, int] = None,
total_tokens_used: int = 0,
):
"""JSON-serializable holder object that stores information related to training progress.
[train/vali/test]_metrics is a nested dictionary of TrainerMetrics: feature_name -> metric_name ->
List[TrainerMetrics], with one entry per training checkpoint.
When the model is saved, all of the progress tracker's attributes are serialized to JSON as
`training_progress.json` under the model output directory.
JSON serialization automatically converts all dictionary top-level keys to strings, and the string typing
is preserved when the progress tracker is deserialized from JSON when model resumes training from a checkpoint.
For this reason, all of the dictionary attributes of the progress tracker are keyed by strings to ensure a
consistent interface before or after deserialization. For example, the `tokens` dictionaries are keyed by steps,
as strings.
When the progress tracker is deserialized from JSON like when a model resumes training from a checkpoint, the
TrainerMetrics namedtuples are automatically converted into regular (epoch, steps, value) tuples, which is why
in trainer.py, we often use `[-1]` to index into the last element of the TrainerMetric namedtuple to get the
actual metric value instead of the named field.
Args:
epoch: The current epoch number.
steps: The current step of training.
batch_size: The current batch size.
tune_checkpoint_num: The hyperopt checkpoint number (Ray Tune).
checkpoint_number: The current checkpoint number.
best_eval_metric_steps: The step of training that has the best evaluation so far.
best_eval_metric_epoch: The epoch of training that has the best evaluation so far.
best_eval_metric_checkpoint_number: The checkpoint number that has the best evaluation so far.
last_improvement_steps: The number of steps since the last improvement.
last_learning_rate_reduction_steps: The training step of the last learning rate reduction.
last_increase_batch_size_steps: The training_step of the the last batch size increase.
best_eval_metric_value: The metric value of the best evaluation so far.
best_increase_batch_size_eval_metric:
The metric value of the best evaluation so far, for increasing the batch size.
last_learning_rate_reduction: The number of steps since the last learning rate reduction.
last_increase_batch_size: The number of steps since the last batch size increase.
last_increase_batch_size_eval_metric_improvement:
The number of checkpoints since the last batch size increase.
num_reductions_learning_rate: The number of total reductions in learning rate.
num_increases_batch_size: The number of total increases in batch size.
train_metrics: Training metrics.