Showing preview only (500K chars total). Download the full file or copy to clipboard to get everything.
Repository: m3dev/gokart
Branch: master
Commit: 0d0609000123
Files: 125
Total size: 466.8 KB
Directory structure:
gitextract_mhb_xs2d/
├── .github/
│ ├── CODEOWNERS
│ └── workflows/
│ ├── format.yml
│ ├── publish.yml
│ └── test.yml
├── .gitignore
├── .readthedocs.yaml
├── LICENSE
├── README.md
├── docs/
│ ├── Makefile
│ ├── conf.py
│ ├── efficient_run_on_multi_workers.rst
│ ├── for_pandas.rst
│ ├── gokart.rst
│ ├── index.rst
│ ├── intro_to_gokart.rst
│ ├── logging.rst
│ ├── make.bat
│ ├── mypy_plugin.rst
│ ├── polars.rst
│ ├── requirements.txt
│ ├── setting_task_parameters.rst
│ ├── slack_notification.rst
│ ├── task_information.rst
│ ├── task_on_kart.rst
│ ├── task_parameters.rst
│ ├── task_settings.rst
│ ├── tutorial.rst
│ └── using_task_task_conflict_prevention_lock.rst
├── examples/
│ ├── gokart_notebook_example.ipynb
│ ├── logging.ini
│ └── param.ini
├── gokart/
│ ├── __init__.py
│ ├── build.py
│ ├── config_params.py
│ ├── conflict_prevention_lock/
│ │ ├── task_lock.py
│ │ └── task_lock_wrappers.py
│ ├── errors/
│ │ └── __init__.py
│ ├── file_processor/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── pandas.py
│ │ └── polars.py
│ ├── file_processor.py
│ ├── gcs_config.py
│ ├── gcs_obj_metadata_client.py
│ ├── gcs_zip_client.py
│ ├── in_memory/
│ │ ├── __init__.py
│ │ ├── data.py
│ │ ├── repository.py
│ │ └── target.py
│ ├── info.py
│ ├── mypy.py
│ ├── object_storage.py
│ ├── pandas_type_config.py
│ ├── parameter.py
│ ├── py.typed
│ ├── required_task_output.py
│ ├── run.py
│ ├── s3_config.py
│ ├── s3_zip_client.py
│ ├── slack/
│ │ ├── __init__.py
│ │ ├── event_aggregator.py
│ │ ├── slack_api.py
│ │ └── slack_config.py
│ ├── target.py
│ ├── task.py
│ ├── task_complete_check.py
│ ├── testing/
│ │ ├── __init__.py
│ │ ├── check_if_run_with_empty_data_frame.py
│ │ └── pandas_assert.py
│ ├── tree/
│ │ ├── task_info.py
│ │ └── task_info_formatter.py
│ ├── utils.py
│ ├── worker.py
│ ├── workspace_management.py
│ ├── zip_client.py
│ └── zip_client_util.py
├── luigi.cfg
├── pyproject.toml
├── test/
│ ├── __init__.py
│ ├── config/
│ │ ├── __init__.py
│ │ ├── pyproject.toml
│ │ ├── pyproject_disallow_missing_parameters.toml
│ │ └── test_config.ini
│ ├── conflict_prevention_lock/
│ │ ├── __init__.py
│ │ ├── test_task_lock.py
│ │ └── test_task_lock_wrappers.py
│ ├── file_processor/
│ │ ├── __init__.py
│ │ ├── test_base.py
│ │ ├── test_factory.py
│ │ ├── test_pandas.py
│ │ └── test_polars.py
│ ├── in_memory/
│ │ ├── test_in_memory_target.py
│ │ └── test_repository.py
│ ├── slack/
│ │ ├── __init__.py
│ │ └── test_slack_api.py
│ ├── test_build.py
│ ├── test_cache_unique_id.py
│ ├── test_config_params.py
│ ├── test_explicit_bool_parameter.py
│ ├── test_gcs_config.py
│ ├── test_gcs_obj_metadata_client.py
│ ├── test_info.py
│ ├── test_large_data_fram_processor.py
│ ├── test_list_task_instance_parameter.py
│ ├── test_mypy.py
│ ├── test_pandas_type_check_framework.py
│ ├── test_pandas_type_config.py
│ ├── test_restore_task_by_id.py
│ ├── test_run.py
│ ├── test_s3_config.py
│ ├── test_s3_zip_client.py
│ ├── test_serializable_parameter.py
│ ├── test_target.py
│ ├── test_task_instance_parameter.py
│ ├── test_task_on_kart.py
│ ├── test_utils.py
│ ├── test_worker.py
│ ├── test_zoned_date_second_parameter.py
│ ├── testing/
│ │ ├── __init__.py
│ │ └── test_pandas_assert.py
│ ├── tree/
│ │ ├── __init__.py
│ │ ├── test_task_info.py
│ │ └── test_task_info_formatter.py
│ └── util.py
└── tox.ini
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/CODEOWNERS
================================================
* @Hi-king @yokomotod @hirosassa @mski-iksm @kitagry @ujiuji1259 @mamo3gr @hiro-o918
================================================
FILE: .github/workflows/format.yml
================================================
name: Lint
on:
push:
branches: [ master ]
pull_request:
jobs:
formatting-check:
name: Lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Set up the latest version of uv
uses: astral-sh/setup-uv@v7
with:
enable-cache: true
- name: Install dependencies
run: |
uv tool install --python-preference only-managed --python 3.13 tox --with tox-uv
- name: Run ruff and mypy
run: |
uvx --with tox-uv tox run -e ruff,mypy
================================================
FILE: .github/workflows/publish.yml
================================================
name: Publish
on:
push:
tags: '*'
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Set up the latest version of uv
uses: astral-sh/setup-uv@v7
with:
enable-cache: true
- name: Build and publish
env:
UV_PUBLISH_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
run: |
uv build
uv publish
================================================
FILE: .github/workflows/test.yml
================================================
name: Test
on:
push:
branches: [ master ]
pull_request:
jobs:
tests:
runs-on: ${{ matrix.platform }}
strategy:
max-parallel: 7
matrix:
platform: ["ubuntu-latest"]
tox-env: ["py310", "py311", "py312", "py313", "py314"]
include:
- platform: macos-15
tox-env: "py313"
- platform: macos-latest
tox-env: "py313"
steps:
- uses: actions/checkout@v6
- name: Set up the latest version of uv
uses: astral-sh/setup-uv@v7
with:
enable-cache: true
- name: Install dependencies
run: |
uv tool install --python-preference only-managed --python 3.13 tox --with tox-uv
- name: Test with tox
run: uvx --with tox-uv tox run -e ${{ matrix.tox-env }}
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
# pycharm
.idea
# gokart
resources
examples/resources
# poetry
dist
# temporary data
temporary.zip
================================================
FILE: .readthedocs.yaml
================================================
# Read the Docs configuration file for Sphinx projects
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Set the OS, Python version and other tools you might need
build:
os: ubuntu-24.04
tools:
python: "3.12"
# Build from the docs/ directory with Sphinx
sphinx:
configuration: docs/conf.py
# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
python:
install:
- requirements: docs/requirements.txt
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2018 M3, Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# gokart
<p align="center">
<img src="https://raw.githubusercontent.com/m3dev/gokart/master/docs/gokart_logo_side_isolation.svg" width="90%">
<p>
[](https://github.com/m3dev/gokart/actions?query=workflow%3ATest)
[](https://gokart.readthedocs.io/en/latest/)
[](https://pypi.org/project/gokart/)
[](https://pypi.org/project/gokart/)

Gokart solves reproducibility, task dependencies, constraints of good code, and ease of use for Machine Learning Pipeline.
[Documentation](https://gokart.readthedocs.io/en/latest/) for the latest release is hosted on readthedocs.
# About gokart
Here are some good things about gokart.
- The following meta data for each Task is stored separately in a `pkl` file with hash value
- task output data
- imported all module versions
- task processing time
- random seed in task
- displayed log
- all parameters set as class variables in the task
- Automatically rerun the pipeline if parameters of Tasks are changed.
- Support GCS and S3 as a data store for intermediate results of Tasks in the pipeline.
- The above output is exchanged between tasks as an intermediate file, which is memory-friendly
- `pandas.DataFrame` type and column checking during I/O
- Directory structure of saved files is automatically determined from structure of script
- Seeds for numpy and random are automatically fixed
- Can code while adhering to [SOLID](https://en.wikipedia.org/wiki/SOLID) principles as much as possible
- Tasks are locked via redis even if they run in parallel
**All the functions above are created for constructing Machine Learning batches. Provides an excellent environment for reproducibility and team development.**
Here are some non-goal / downside of the gokart.
- Batch execution in parallel is supported, but parallel and concurrent execution of task in memory.
- Gokart is focused on reproducibility. So, I/O and capacity of data storage can become a bottleneck.
- No support for task visualize.
- Gokart is not an experiment management tool. The management of the execution result is cut out as [Thunderbolt](https://github.com/m3dev/thunderbolt).
- Gokart does not recommend writing pipelines in toml, yaml, json, and more. Gokart is preferring to write them in Python.
# Getting Started
Within the activated Python environment, use the following command to install gokart.
```
pip install gokart
```
# Quickstart
## Minimal Example
A minimal gokart tasks looks something like this:
```python
import gokart
class Example(gokart.TaskOnKart):
def run(self):
self.dump('Hello, world!')
task = Example()
output = gokart.build(task)
print(output)
```
`gokart.build` return the result of dump by `gokart.TaskOnKart`. The example will output the following.
```
Hello, world!
```
## Type-Safe Pipeline Example
We introduce type-annotations to make a gokart pipeline robust.
Please check the following example to see how to use type-annotations on gokart.
Before using this feature, ensure to enable [mypy plugin](https://gokart.readthedocs.io/en/latest/mypy_plugin.html) feature in your project.
```python
import gokart
# `gokart.TaskOnKart[str]` means that the task dumps `str`
class StrDumpTask(gokart.TaskOnKart[str]):
def run(self):
self.dump('Hello, world!')
# `gokart.TaskOnKart[int]` means that the task dumps `int`
class OneDumpTask(gokart.TaskOnKart[int]):
def run(self):
self.dump(1)
# `gokart.TaskOnKart[int]` means that the task dumps `int`
class TwoDumpTask(gokart.TaskOnKart[int]):
def run(self):
self.dump(2)
class AddTask(gokart.TaskOnKart[int]):
# `a` requires a task to dump `int`
a: gokart.TaskInstanceParameter[gokart.TaskOnKart[int]] = gokart.TaskInstanceParameter()
# `b` requires a task to dump `int`
b: gokart.TaskInstanceParameter[gokart.TaskOnKart[int]] = gokart.TaskInstanceParameter()
def requires(self):
return dict(a=self.a, b=self.b)
def run(self):
# loading by instance parameter,
# `a` and `b` are treated as `int`
# because they are declared as `gokart.TaskOnKart[int]`
a = self.load(self.a)
b = self.load(self.b)
self.dump(a + b)
valid_task = AddTask(a=OneDumpTask(), b=TwoDumpTask())
# the next line will show type error by mypy
# because `StrDumpTask` dumps `str` and `AddTask` requires `int`
invalid_task = AddTask(a=OneDumpTask(), b=StrDumpTask())
```
This is an introduction to some of the gokart.
There are still more useful features.
Please See [Documentation](https://gokart.readthedocs.io/en/latest/) .
Have a good gokart life.
# Achievements
Gokart is a proven product.
- It's actually been used by [m3.inc](https://corporate.m3.com/en) for over 3 years
- Natural Language Processing Competition by [Nishika.inc](https://nishika.com) 2nd prize : [Solution Repository](https://github.com/vaaaaanquish/nishika_akutagawa_2nd_prize)
# Thanks
gokart is a wrapper for luigi. Thanks to luigi and dependent projects!
- [luigi](https://github.com/spotify/luigi)
================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/conf.py
================================================
# https://github.com/sphinx-doc/sphinx/issues/6211
import luigi
import gokart
luigi.task.Task.requires.__doc__ = gokart.task.TaskOnKart.requires.__doc__
luigi.task.Task.output.__doc__ = gokart.task.TaskOnKart.output.__doc__
#
# Configuration file for the Sphinx documentation builder.
#
# This file does only contain a selection of the most common options. For a
# full list see the documentation:
# http://www.sphinx-doc.org/en/master/config
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
# import os
# import sys
# sys.path.insert(0, os.path.abspath('../gokart/'))
# -- Project information -----------------------------------------------------
project = 'gokart'
copyright = '2019, Masahiro Nishiba'
author = 'Masahiro Nishiba'
# The short X.Y version
version = ''
# The full version, including alpha/beta/rc tags
release = ''
# -- General configuration ---------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.viewcode']
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'
# The master toctree document.
master_doc = 'index'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = []
# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# The default sidebars (for documents that don't match any pattern) are
# defined by theme itself. Builtin themes are using these templates by
# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
# 'searchbox.html']``.
# html_sidebars = {}
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = 'gokartdoc'
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'gokart.tex', 'gokart Documentation', 'Masahiro Nishiba', 'manual'),
]
# -- Options for manual page output ------------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [(master_doc, 'gokart', 'gokart Documentation', [author], 1)]
# -- Options for Texinfo output ----------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'gokart', 'gokart Documentation', author, 'gokart', 'One line description of project.', 'Miscellaneous'),
]
# -- Options for Epub output -------------------------------------------------
# Bibliographic Dublin Core info.
epub_title = project
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''
# A unique identification for the text.
#
# epub_uid = ''
# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
================================================
FILE: docs/efficient_run_on_multi_workers.rst
================================================
How to improve efficiency when running on multiple workers
===========================================================
If multiple worker nodes are running similar gokart pipelines in parallel, it is possible that the exact same task may be executed by multiple workers.
(For example, when training multiple machine learning models with different parameters, the feature creation task in the first stage is expected to be exactly the same.)
It is inefficient to execute the same task on each of multiple worker nodes, so we want to avoid this.
Here we introduce `should_lock_run` feature to improve this inefficiency.
Suppress run() of the same task with `should_lock_run`
------------------------------------------------------
When `gokart.TaskOnKart.should_lock_run` is set to True, the task will fail if the same task is run()-ing by another worker.
By failing the task, other tasks that can be executed at that time are given priority.
After that, the failed task is automatically re-executed.
.. code:: python
class SampleTask2(gokart.TaskOnKart):
should_lock_run = True
Additional Option
------------------
Skip completed tasks with `complete_check_at_run`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By setting `gokart.TaskOnKart.complete_check_at_run` to True, the existence of the cache can be rechecked at run() time.
Default is True, but if the check takes too much time, you can set to False to inactivate the check.
.. code:: python
class SampleTask1(gokart.TaskOnKart):
complete_check_at_run = False
================================================
FILE: docs/for_pandas.rst
================================================
For Pandas
==========
Gokart has several features for Pandas.
Pandas Type Config
------------------
Pandas has a feature that converts the type of column(s) automatically. This feature sometimes cause wrong result. To avoid unintentional type conversion of pandas, we can specify a column name to check the type of Task input and output in gokart.
.. code:: python
from typing import Any, Dict
import pandas as pd
import gokart
# Please define a class which inherits `gokart.PandasTypeConfig`.
class SamplePandasTypeConfig(gokart.PandasTypeConfig):
@classmethod
def type_dict(cls) -> Dict[str, Any]:
return {'int_column': int}
class SampleTask(gokart.TaskOnKart[pd.DataFrame]):
def run(self):
# [PandasTypeError] because expected type is `int`, but `str` is passed.
df = pd.DataFrame(dict(int_column=['a']))
self.dump(df)
This is useful when dataframe has nullable columns because pandas auto-conversion often fails in such case.
Easy to Load DataFrame
----------------------
The :func:`~gokart.task.TaskOnKart.load` method is used to load input ``pandas.DataFrame``.
.. code:: python
def requires(self):
return MakeDataFrameTask()
def run(self):
df = self.load()
Please refer to :func:`~gokart.task.TaskOnKart.load`.
Fail on empty DataFrame
-----------------------
When the :attr:`~gokart.task.TaskOnKart.fail_on_empty_dump` parameter is true, the :func:`~gokart.task.TaskOnKart.dump()` method raises :class:`~gokart.errors.EmptyDumpError` on trying to dump empty ``pandas.DataFrame``.
.. code:: python
import gokart
class EmptyTask(gokart.TaskOnKart):
def run(self):
df = pd.DataFrame()
self.dump(df)
::
$ python main.py EmptyTask --fail-on-empty-dump true
# EmptyDumpError
$ python main.py EmptyTask
# Task will be ran and outputs an empty dataframe
Empty caches sometimes hide bugs and let us spend much time debugging. This feature notifies us some bugs (including wrong datasources) in the early stage.
Please refer to :attr:`~gokart.task.TaskOnKart.fail_on_empty_dump`.
================================================
FILE: docs/gokart.rst
================================================
gokart package
==============
Submodules
----------
gokart.file\_processor module
-----------------------------
.. automodule:: gokart.file_processor
:members:
:undoc-members:
:show-inheritance:
gokart.info module
------------------
.. automodule:: gokart.info
:members:
:undoc-members:
:show-inheritance:
gokart.parameter module
-----------------------
.. automodule:: gokart.parameter
:members:
:undoc-members:
:show-inheritance:
gokart.run module
-----------------
.. automodule:: gokart.run
:members:
:undoc-members:
:show-inheritance:
gokart.s3\_config module
------------------------
.. automodule:: gokart.s3_config
:members:
:undoc-members:
:show-inheritance:
gokart.target module
--------------------
.. automodule:: gokart.target
:members:
:undoc-members:
:show-inheritance:
gokart.task module
------------------
.. automodule:: gokart.task
:members:
:undoc-members:
:show-inheritance:
gokart.workspace\_management module
-----------------------------------
.. automodule:: gokart.workspace_management
:members:
:undoc-members:
:show-inheritance:
gokart.zip\_client module
-------------------------
.. automodule:: gokart.zip_client
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: gokart
:members:
:undoc-members:
:show-inheritance:
================================================
FILE: docs/index.rst
================================================
.. gokart documentation master file, created by
sphinx-quickstart on Fri Jan 11 07:59:25 2019.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to gokart's documentation!
==================================
Useful links: `GitHub <https://github.com/m3dev/gokart>`_ | `cookiecutter gokart <https://github.com/m3dev/cookiecutter-gokart>`_
`Gokart <https://github.com/m3dev/gokart>`_ is a wrapper of the data pipeline library `luigi <https://github.com/spotify/luigi>`_. Gokart solves "**reproducibility**", "**task dependencies**", "**constraints of good code**", and "**ease of use**" for Machine Learning Pipeline.
Good thing about gokart
-----------------------
Here are some good things about gokart.
- The following data for each Task is stored separately in a pkl file with hash value
- task output data
- imported all module versions
- task processing time
- random seed in task
- displayed log
- all parameters set as class variables in the task
- If change parameter of Task, rerun spontaneously.
- The above file will be generated with a different hash value
- The hash value of dependent task will also change and both will be rerun
- Support GCS or S3
- The above output is exchanged between tasks as an intermediate file, which is memory-friendly
- pandas.DataFrame type and column checking during I/O
- Directory structure of saved files is automatically determined from structure of script
- Seeds for numpy and random are automatically fixed
- Can code while adhering to SOLID principles as much as possible
- Tasks are locked via redis even if they run in parallel
**These are all functions baptized for creating Machine Learning batches. Provides an excellent environment for reproducibility and team development.**
Getting started
-----------------
.. toctree::
:maxdepth: 2
intro_to_gokart
tutorial
User Guide
-----------------
.. toctree::
:maxdepth: 2
task_on_kart
task_parameters
setting_task_parameters
task_settings
task_information
logging
slack_notification
using_task_task_conflict_prevention_lock
efficient_run_on_multi_workers
for_pandas
polars
mypy_plugin
API References
--------------
.. toctree::
:maxdepth: 2
gokart
Indices and tables
-------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
================================================
FILE: docs/intro_to_gokart.rst
================================================
Intro To Gokart
===============
Installation
------------
Within the activated Python environment, use the following command to install gokart.
.. code:: sh
pip install gokart
Quickstart
----------
A minimal gokart tasks looks something like this:
.. code:: python
import gokart
class Example(gokart.TaskOnKart[str]):
def run(self):
self.dump('Hello, world!')
task = Example()
output = gokart.build(task)
print(output)
``gokart.build`` return the result of dump by ``gokart.TaskOnKart``. The example will output the following.
.. code:: sh
Hello, world!
``gokart`` records all the information needed for Machine Learning. By default, ``resources`` will be generated in the same directory as the script.
.. code:: sh
$ tree resources/
resources/
├── __main__
│ └── Example_8441c59b5ce0113396d53509f19371fb.pkl
└── log
├── module_versions
│ └── Example_8441c59b5ce0113396d53509f19371fb.txt
├── processing_time
│ └── Example_8441c59b5ce0113396d53509f19371fb.pkl
├── random_seed
│ └── Example_8441c59b5ce0113396d53509f19371fb.pkl
├── task_log
│ └── Example_8441c59b5ce0113396d53509f19371fb.pkl
└── task_params
└── Example_8441c59b5ce0113396d53509f19371fb.pkl
The result of dumping the task will be saved in the ``__name__`` directory.
.. code:: python
import pickle
with open('resources/__main__/Example_8441c59b5ce0113396d53509f19371fb.pkl', 'rb') as f:
print(pickle.load(f)) # Hello, world!
That will be given hash value depending on the parameter of the task. This means that if you change the parameter of the task, the hash value will change, and change output file. This is very useful when changing parameters and experimenting. Please refer to :doc:`task_parameters` section for task parameters. Also see :doc:`task_on_kart` section for information on how to return this output destination.
In addition, the following files are automatically saved as ``log``.
- ``module_versions``: The versions of all modules that were imported when the script was executed. For reproducibility.
- ``processing_time``: The execution time of the task.
- ``random_seed``: This is random seed of python and numpy. For reproducibility in Machine Learning. Please refer to :doc:`task_settings` section.
- ``task_log``: This is the output of the task logger.
- ``task_params``: This is task's parameters. Please refer to :doc:`task_parameters` section.
How to running task
-------------------
Gokart has ``run`` and ``build`` methods for running task. Each has a different purpose.
- ``gokart.run``: uses arguments on the shell. return retcode.
- ``gokart.build``: uses inline code on jupyter notebook, IPython, and more. return task output.
.. note::
It is not recommended to use ``gokart.run`` and ``gokart.build`` together in the same script. Because ``gokart.build`` will clear the contents of ``luigi.register``. It's the only way to handle duplicate tasks.
gokart.run
~~~~~~~~~~
The :func:`~gokart.run` is running on shell.
.. code:: python
import gokart
import luigi
class SampleTask(gokart.TaskOnKart[str]):
param = luigi.Parameter()
def run(self):
self.dump(self.param)
gokart.run()
.. code:: sh
python sample.py SampleTask --local-scheduler --param=hello
If you were to write it in Python, it would be the same as the following behavior.
.. code:: python
gokart.run(['SampleTask', '--local-scheduler', '--param=hello'])
gokart.build
~~~~~~~~~~~~
The :func:`~gokart.build` is inline code.
.. code:: python
import gokart
import luigi
class SampleTask(gokart.TaskOnKart[str]):
param: luigi.Parameter = luigi.Parameter()
def run(self):
self.dump(self.param)
gokart.build(SampleTask(param='hello'), return_value=False)
To output logs of each tasks, you can pass `~log_level` parameter to `~gokart.build` as follows:
.. code:: python
gokart.build(SampleTask(param='hello'), return_value=False, log_level=logging.DEBUG)
This feature is very useful for running `~gokart` on jupyter notebook.
When some tasks are failed, gokart.build raises GokartBuildError. If you have to get tracebacks, you should set `log_level` as `logging.DEBUG`.
================================================
FILE: docs/logging.rst
================================================
Logging
=======
How to set up a common logger for gokart.
Core settings
-------------
Please write a configuration file similar to the following:
::
# base.ini
[core]
logging_conf_file=./conf/logging.ini
.. code:: python
import gokart
gokart.add_config('base.ini')
Logger ini file
---------------
It is the same as a general logging.ini file.
::
[loggers]
keys=root,luigi,luigi-interface,gokart,gokart.file_processor
[handlers]
keys=stderrHandler
[formatters]
keys=simpleFormatter
[logger_root]
level=INFO
handlers=stderrHandler
[logger_gokart]
level=INFO
handlers=stderrHandler
qualname=gokart
propagate=0
[logger_luigi]
level=INFO
handlers=stderrHandler
qualname=luigi
propagate=0
[logger_luigi-interface]
level=INFO
handlers=stderrHandler
qualname=luigi-interface
propagate=0
[logger_gokart.file_processor]
level=CRITICAL
handlers=stderrHandler
qualname=gokart.file_processor
[handler_stderrHandler]
class=StreamHandler
formatter=simpleFormatter
args=(sys.stdout,)
[formatter_simpleFormatter]
format=[%(asctime)s][%(name)s][%(levelname)s](%(filename)s:%(lineno)s) %(message)s
datefmt=%Y/%m/%d %H:%M:%S
Please refer to `Python logging documentation <https://docs.python.org/3/library/logging.config.html>`_
================================================
FILE: docs/make.bat
================================================
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
:end
popd
================================================
FILE: docs/mypy_plugin.rst
================================================
[Experimental] Mypy plugin
===========================
Mypy plugin provides type checking for gokart tasks using Mypy.
This feature is experimental.
How to use
--------------
Configure Mypy to use this plugin by adding the following to your ``mypy.ini`` file:
.. code:: ini
[mypy]
plugins = gokart.mypy:plugin
or by adding the following to your ``pyproject.toml`` file:
.. code:: toml
[tool.mypy]
plugins = ["gokart.mypy"]
Then, run Mypy as usual.
Examples
--------
For example the following code linted by Mypy:
.. code:: python
import gokart
import luigi
class Foo(gokart.TaskOnKart):
# NOTE: must all the parameters be annotated
foo: int = luigi.IntParameter(default=1)
bar: str = luigi.Parameter()
Foo(foo=1, bar='2') # OK
Foo(foo='1') # NG because foo is not int and bar is missing
Mypy plugin checks TaskOnKart generic types.
.. code:: python
class SampleTask(gokart.TaskOnKart):
str_task: gokart.TaskOnKart[str] = gokart.TaskInstanceParameter()
int_task: gokart.TaskOnKart[int] = gokart.TaskInstanceParameter()
def requires(self):
return dict(str=self.str_task, int=self.int_task)
def run(self):
s = self.load(self.str_task) # This type is inferred with "str"
i = self.load(self.int_task) # This type is inferred with "int"
SampleTask(
str_task=StrTask(), # mypy ok
int_task=StrTask(), # mypy error: Argument "int_task" to "StrTask" has incompatible type "StrTask"; expected "TaskOnKart[int]
)
Configurations (only pyproject.toml)
-----------------------------------
You can configure the Mypy plugin using the ``pyproject.toml`` file.
The following options are available:
.. code:: toml
[tool.gokart-mypy]
# If true, Mypy will raise an error if a task is missing required parameters.
# This configuration causes an error when the parameters set by `luigi.Config()`
# Default: false
disallow_missing_parameters = true
================================================
FILE: docs/polars.rst
================================================
Polars Support
==============
Gokart supports Polars DataFrames alongside pandas DataFrames for DataFrame-based file processors. This allows gradual migration from pandas to Polars or using both libraries simultaneously in your data pipelines.
Installation
------------
Polars support is optional. Install it with:
.. code:: bash
pip install gokart[polars]
Or install Polars separately:
.. code:: bash
pip install polars
Basic Usage
-----------
To use Polars DataFrames with gokart, specify ``dataframe_type='polars'`` when creating file processors:
.. code:: python
import polars as pl
from gokart import TaskOnKart
from gokart.file_processor import FeatherFileProcessor
class MyPolarsTask(TaskOnKart[pl.DataFrame]):
def output(self):
return self.make_target(
'path/to/target.feather',
processor=FeatherFileProcessor(
store_index_in_feather=False,
dataframe_type='polars'
)
)
def run(self):
df = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
self.dump(df)
Supported File Processors
--------------------------
The following file processors support the ``dataframe_type`` parameter:
CsvFileProcessor
^^^^^^^^^^^^^^^^
.. code:: python
from gokart.file_processor import CsvFileProcessor
# For Polars
processor = CsvFileProcessor(sep=',', encoding='utf-8', dataframe_type='polars')
# For pandas (default)
processor = CsvFileProcessor(sep=',', encoding='utf-8', dataframe_type='pandas')
# or simply
processor = CsvFileProcessor(sep=',', encoding='utf-8')
JsonFileProcessor
^^^^^^^^^^^^^^^^^
.. code:: python
from gokart.file_processor import JsonFileProcessor
# For Polars
processor = JsonFileProcessor(orient='records', dataframe_type='polars')
# For pandas (default)
processor = JsonFileProcessor(orient='records', dataframe_type='pandas')
ParquetFileProcessor
^^^^^^^^^^^^^^^^^^^^
.. code:: python
from gokart.file_processor import ParquetFileProcessor
# For Polars
processor = ParquetFileProcessor(
compression='gzip',
dataframe_type='polars'
)
# For pandas (default)
processor = ParquetFileProcessor(
compression='gzip',
dataframe_type='pandas'
)
FeatherFileProcessor
^^^^^^^^^^^^^^^^^^^^
.. code:: python
from gokart.file_processor import FeatherFileProcessor
# For Polars
processor = FeatherFileProcessor(
store_index_in_feather=False,
dataframe_type='polars'
)
# For pandas (default)
processor = FeatherFileProcessor(
store_index_in_feather=True,
dataframe_type='pandas'
)
.. note::
The ``store_index_in_feather`` parameter is pandas-specific and is ignored when using Polars.
Using Pandas and Polars Together
---------------------------------
Since projects often migrate from pandas gradually, gokart allows you to use both pandas and Polars simultaneously:
.. code:: python
import pandas as pd
import polars as pl
from gokart import TaskOnKart
from gokart.file_processor import FeatherFileProcessor
class PandasTask(TaskOnKart[pd.DataFrame]):
"""Task that outputs pandas DataFrame"""
def output(self):
return self.make_target(
'path/to/pandas_output.feather',
processor=FeatherFileProcessor(
store_index_in_feather=False,
dataframe_type='pandas'
)
)
def run(self):
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
self.dump(df)
class PolarsTask(TaskOnKart[pl.DataFrame]):
"""Task that outputs Polars DataFrame"""
def requires(self):
return PandasTask()
def output(self):
return self.make_target(
'path/to/polars_output.feather',
processor=FeatherFileProcessor(
store_index_in_feather=False,
dataframe_type='polars'
)
)
def run(self):
# Load pandas DataFrame and convert to Polars
pandas_df = self.load() # Returns pandas DataFrame
polars_df = pl.from_pandas(pandas_df)
# Process with Polars
result = polars_df.with_columns(
(pl.col('a') * 2).alias('a_doubled')
)
self.dump(result)
Default Behavior
----------------
When ``dataframe_type`` is not specified, file processors default to ``'pandas'`` for backward compatibility:
.. code:: python
# These are equivalent
processor = CsvFileProcessor(sep=',')
processor = CsvFileProcessor(sep=',', dataframe_type='pandas')
Important Notes
---------------
**File Format Compatibility**
Files created with Polars processors can be read by pandas processors and vice versa. The underlying file formats (CSV, JSON, Parquet, Feather) are library-agnostic.
**Pandas-specific Features**
Some pandas-specific features are not available with Polars:
- ``store_index_in_feather`` parameter in ``FeatherFileProcessor`` is ignored for Polars
- ``engine`` parameter in ``ParquetFileProcessor`` is ignored for Polars (uses Polars' default)
**Error Handling**
If you specify ``dataframe_type='polars'`` but Polars is not installed, you'll get an ``ImportError`` with installation instructions:
.. code:: text
ImportError: polars is required for dataframe_type='polars'. Install with: pip install polars
Migration Strategy
------------------
Recommended approach for migrating from pandas to Polars:
1. Install Polars: ``pip install gokart[polars]``
2. Create new tasks using ``dataframe_type='polars'``
3. Keep existing tasks with ``dataframe_type='pandas'`` or default behavior
4. Gradually migrate tasks as needed
5. Convert DataFrames between libraries using ``pl.from_pandas()`` and ``df.to_pandas()`` when necessary
================================================
FILE: docs/requirements.txt
================================================
Sphinx
gokart
sphinx-rtd-theme
================================================
FILE: docs/setting_task_parameters.rst
================================================
============================
Setting Task Parameters
============================
There are several ways to set task parameters.
- Set parameter from command line
- Set parameter at config file
- Set parameter at upstream task
- Inherit parameter from other task
Set parameter from command line
==================================
.. code:: sh
python main.py sample.SomeTask --SomeTask-param=Hello
Parameter of each task can be set as a command line parameter in ``--[task name]-[parameter name]=[value]`` format.
Set parameter at config file
==================================
::
[sample.SomeTask]
param = Hello
Above config file (``config.ini``) must be read before ``gokart.run()`` as the following code:
.. code:: python
if __name__ == '__main__':
gokart.add_config('./conf/config.ini')
gokart.run()
It can also be loaded from environment variable as the following code:
::
[sample.SomeTask]
param=${PARAMS}
[TaskOnKart]
workspace_directory=${WORKSPACE_DIRECTORY}
The advantages of using environment variables are 1) important information will not be logged 2) common settings can be used.
Set parameter at upstream task
==================================
Parameters can be set at the upstream task, as in a typical pipeline.
.. code:: python
class UpstreamTask(gokart.TaskOnKart):
def requires(self):
return dict(sometask=SomeTask(param='Hello'))
Inherit parameter from other task
==================================
Parameter values can be inherited from other task using ``@inherits_config_params`` decorator.
.. code:: python
class MasterConfig(luigi.Config):
param: luigi.Parameter = luigi.Parameter()
param2: luigi.Parameter = luigi.Parameter()
@inherits_config_params(MasterConfig)
class SomeTask(gokart.TaskOnKart):
param: luigi.Parameter = luigi.Parameter()
This is useful when multiple tasks has the same parameter. In the above example, parameter settings of ``MasterConfig`` will be inherited to all tasks decorated with ``@inherits_config_params(MasterConfig)`` as ``SomeTask``.
Note that only parameters which exist in both ``MasterConfig`` and ``SomeTask`` will be inherited.
In the above example, ``param2`` will not be available in ``SomeTask``, since ``SomeTask`` does not have ``param2`` parameter.
.. code:: python
class MasterConfig(luigi.Config):
param: luigi.Parameter = luigi.Parameter()
param2: luigi.Parameter = luigi.Parameter()
@inherits_config_params(MasterConfig, parameter_alias={'param2': 'param3'})
class SomeTask(gokart.TaskOnKart):
param3: luigi.Parameter = luigi.Parameter()
You may also set a parameter name alias by setting ``parameter_alias``.
``parameter_alias`` must be a dictionary of key: inheriting task's parameter name, value: decorating task's parameter name.
In the above example, ``SomeTask.param3`` will be set to same value as ``MasterConfig.param2``.
================================================
FILE: docs/slack_notification.rst
================================================
Slack notification
=========================
Prerequisites
-------------
Prepare following environmental variables:
.. code:: sh
export SLACK_TOKEN=xoxb-your-token // should use token starts with "xoxb-" (bot token is preferable)
export SLACK_CHANNEL=channel-name // not "#channel-name", just "channel-name"
A Slack bot token can obtain from `slack app document <https://api.slack.com/apps>`_.
A bot token needs following scopes:
- `channels:read`
- `chat:write`
- `files:write`
More about scopes are `slack scopes document <https://api.slack.com/scopes>`_.
Implement Slack notification
----------------------------
Write following codes pass arguments to your gokart workflow.
.. code:: python
cmdline_args = sys.argv[1:]
if 'SLACK_CHANNEL' in os.environ:
cmdline_args.append(f'--SlackConfig-channel={os.environ["SLACK_CHANNEL"]}')
if 'SLACK_TO_USER' in os.environ:
cmdline_args.append(f'--SlackConfig-to-user={os.environ["SLACK_TO_USER"]}')
gokart.run(cmdline_args)
================================================
FILE: docs/task_information.rst
================================================
Task Information
================
There are 6 ways to print the significant parameters and state of the task and its dependencies.
* 1. One is to use luigi module. See `luigi.tools.deps_tree module <https://luigi.readthedocs.io/en/stable/api/luigi.tools.deps_tree.html>`_ for details.
* 2. ``task-info`` option of ``gokart.run()``.
* 3. ``make_task_info_as_tree_str()`` will return significant parameters and dependency tree as str.
* 4. ``make_task_info_as_table()`` will return significant parameter and dependent tasks as pandas.DataFrame table format.
* 5. ``dump_task_info_table()`` will dump the result of ``make_task_info_as_table()`` to a file.
* 6. ``dump_task_info_tree()`` will dump the task tree object (TaskInfo) to a pickle file.
This document will cover 2~6.
2. task-info option of gokart.run()
--------------------------------------------
On CLI
~~~~~~
An example implementation could be like:
.. code:: python
# main.py
import gokart
if __name__ == '__main__':
gokart.run()
.. code:: sh
$ python main.py \
TaskB \
--param=Hello \
--local-scheduler \
--tree-info-mode=all \
--tree-info-output-path=tree_all.txt
The ``--tree-info-mode`` option accepts "simple" and "all", and a task information is saved in ``--tree-info-output-path``.
when "simple" is passed, it outputs the states and the unique ids of tasks.
An example output is as follows:
.. code:: text
└─-(COMPLETE) TaskB[09fe5591ef2969ce7443c419a3b19e5d]
└─-(COMPLETE) TaskA[2549878535c070fb6c3cd4061bdbbcff]
When "all" is passed, it outputs the states, the unique ids, the significant parameters, the execution times and the task logs of tasks.
An example output is as follows:
.. code:: text
└─-(COMPLETE) TaskB[09fe5591ef2969ce7443c419a3b19e5d](parameter={'workspace_directory': './resources/', 'local_temporary_directory': './resources/tmp/', 'param': 'Hello'}, output=['./resources/output_of_task_b_09fe5591ef2969ce7443c419a3b19e5d.pkl'], time=0.002290010452270508s, task_log={})
└─-(COMPLETE) TaskA[2549878535c070fb6c3cd4061bdbbcff](parameter={'workspace_directory': './resources/', 'local_temporary_directory': './resources/tmp/', 'param': 'called by TaskB'}, output=['./resources/output_of_task_a_2549878535c070fb6c3cd4061bdbbcff.pkl'], time=0.0009829998016357422s, task_log={})
3. make_task_info_as_tree_str()
-----------------------------------------
``gokart.tree.task_info.make_task_info_as_tree_str()`` will return a tree dependency tree as a str.
.. code:: python
from gokart.tree.task_info import make_task_info_as_tree_str
make_task_info_as_tree_str(task, ignore_task_names)
# Parameters
# ----------
# - task: TaskOnKart
# Root task.
# - details: bool
# Whether or not to output details.
# - abbr: bool
# Whether or not to simplify tasks information that has already appeared.
# - ignore_task_names: Optional[List[str]]
# List of task names to ignore.
# Returns
# -------
# - tree_info : str
# Formatted task dependency tree.
example
.. code:: python
import luigi
import gokart
class TaskA(gokart.TaskOnKart[str]):
param = luigi.Parameter()
def run(self):
self.dump(f'{self.param}')
class TaskB(gokart.TaskOnKart[str]):
task: gokart.TaskOnKart[str] = gokart.TaskInstanceParameter()
def run(self):
task = self.load('task')
self.dump(task + ' taskB')
class TaskC(gokart.TaskOnKart[str]):
task: gokart.TaskOnKart[str] = gokart.TaskInstanceParameter()
def run(self):
task = self.load('task')
self.dump(task + ' taskC')
class TaskD(gokart.TaskOnKart):
task1: gokart.TaskOnKart[str] = gokart.TaskInstanceParameter()
task2: gokart.TaskOnKart[str] = gokart.TaskInstanceParameter()
def run(self):
task = [self.load('task1'), self.load('task2')]
self.dump(','.join(task))
.. code:: python
task = TaskD(
task1=TaskD(
task1=TaskD(task1=TaskC(task=TaskA(param='foo')), task2=TaskC(task=TaskB(task=TaskA(param='bar')))), # same task
task2=TaskD(task1=TaskC(task=TaskA(param='foo')), task2=TaskC(task=TaskB(task=TaskA(param='bar')))) # same task
),
task2=TaskD(
task1=TaskD(task1=TaskC(task=TaskA(param='foo')), task2=TaskC(task=TaskB(task=TaskA(param='bar')))), # same task
task2=TaskD(task1=TaskC(task=TaskA(param='foo')), task2=TaskC(task=TaskB(task=TaskA(param='bar')))) # same task
)
)
print(gokart.make_task_info_as_tree_str(task))
.. code:: sh
└─-(PENDING) TaskD[187ff82158671283e127e2e1f7c9c095]
|--(PENDING) TaskD[ca9e943ce049e992b371898c0578784e] # duplicated TaskD
| |--(PENDING) TaskD[1cc9f9fc54a56614f3adef74398684f4] # duplicated TaskD
| | |--(PENDING) TaskC[dce3d8e7acaf1bb9731fb4f2ae94e473]
| | | └─-(PENDING) TaskA[be65508b556dd3752359b4246791413d]
| | └─-(PENDING) TaskC[de39593d31490aba3cdca3c650432504]
| | └─-(PENDING) TaskB[bc2f7d6cdd6521cc116c35f0f144eed3]
| | └─-(PENDING) TaskA[5a824f7d232eb69d46f0ac6bbd93b565]
| └─-(PENDING) TaskD[1cc9f9fc54a56614f3adef74398684f4]
| └─- ...
└─-(PENDING) TaskD[ca9e943ce049e992b371898c0578784e]
└─- ...
In the above example, the sub-trees already shown is omitted.
This can be disabled by passing ``False`` to ``abbr`` flag:
.. code:: python
print(make_task_info_as_tree_str(task, abbr=False))
4. make_task_info_as_table()
--------------------------------
``gokart.tree.task_info.make_task_info_as_table()`` will return a table containing the information of significant parameters and dependent tasks as a pandas DataFrame.
This table contains `task name`, `cache unique id`, `cache file path`, `task parameters`, `task processing time`, `completed flag`, and `task log`.
.. code:: python
from gokart.tree.task_info import make_task_info_as_table
make_task_info_as_table(task, ignore_task_names)
# """Return a table containing information about dependent tasks.
#
# Parameters
# ----------
# - task: TaskOnKart
# Root task.
# - ignore_task_names: Optional[List[str]]
# List of task names to ignore.
# Returns
# -------
# - task_info_table : pandas.DataFrame
# Formatted task dependency table.
# """
5. dump_task_info_table()
-----------------------------------------
``gokart.tree.task_info.dump_task_info_table()`` will dump the task_info table made at ``make_task_info_as_table()`` to a file.
.. code:: python
from gokart.tree.task_info import dump_task_info_table
dump_task_info_table(task, task_info_dump_path, ignore_task_names)
# Parameters
# ----------
# - task: TaskOnKart
# Root task.
# - task_info_dump_path: str
# Output target file path. Path destination can be `local`, `S3`, or `GCS`.
# File extension can be any type that gokart file processor accepts, including `csv`, `pickle`, or `txt`.
# See `TaskOnKart.make_target module <https://gokart.readthedocs.io/en/latest/task_on_kart.html#taskonkart-make-target>` for details.
# - ignore_task_names: Optional[List[str]]
# List of task names to ignore.
# Returns
# -------
# None
6. dump_task_info_tree()
-----------------------------------------
``gokart.tree.task_info.dump_task_info_tree()`` will dump the task tree object (TaskInfo) to a pickle file.
.. code:: python
from gokart.tree.task_info import dump_task_info_tree
dump_task_info_tree(task, task_info_dump_path, ignore_task_names, use_unique_id)
# Parameters
# ----------
# - task: TaskOnKart
# Root task.
# - task_info_dump_path: str
# Output target file path. Path destination can be `local`, `S3`, or `GCS`.
# File extension must be '.pkl'.
# - ignore_task_names: Optional[List[str]]
# List of task names to ignore.
# - use_unique_id: bool = True
# Whether to use unique id to dump target file. Default is True.
# Returns
# -------
# None
Task Logs
---------
To output extra information of tasks by ``tree-info``, the member variable :attr:`~gokart.task.TaskOnKart.task_log` of ``TaskOnKart`` keeps any information as a dictionary.
For instance, the following code runs,
.. code:: python
import gokart
class SampleTaskLog(gokart.TaskOnKart):
def run(self):
# Add some logs.
self.task_log['sample key'] = 'sample value'
if __name__ == '__main__':
SampleTaskLog().run()
gokart.run([
'--tree-info-mode=all',
'--tree-info-output-path=sample_task_log.txt',
'SampleTaskLog',
'--local-scheduler'])
the output could be like:
.. code:: text
└─-(COMPLETE) SampleTaskLog[...](..., task_log={'sample key': 'sample value'})
Delete Unnecessary Output Files
--------------------------------
To delete output files which are not necessary to run a task, add option ``--delete-unnecessary-output-files``. This option is supported only when a task outputs files in local storage not S3 for now.
================================================
FILE: docs/task_on_kart.rst
================================================
TaskOnKart
==========
``TaskOnKart`` inherits ``luigi.Task``, and has functions to make it easy to define tasks.
Please see `luigi documentation <https://luigi.readthedocs.io/en/stable/index.html>`_ for details of ``luigi.Task``.
Please refer to :doc:`intro_to_gokart` section and :doc:`tutorial` section.
Outline
--------
How ``TaskOnKart`` helps to define a task looks like:
.. code:: python
import luigi
import gokart
class TaskA(gokart.TaskOnKart[str]):
param: luigi.Parameter = luigi.Parameter()
def output(self):
return self.make_target('output_of_task_a.pkl')
def run(self):
results = f'param={self.param}'
self.dump(results)
class TaskB(gokart.TaskOnKart[str]):
param: luigi.Parameter = luigi.Parameter()
def requires(self):
return TaskA(param='world')
def output(self):
# `make_target` makes an instance of `luigi.Target`.
# This infers the output format and the destination of an output objects.
# The target file path is
# '{self.workspace_directory}/output_of_task_b_{self.make_unique_id()}.pkl'.
return self.make_target('output_of_task_b.pkl')
def run(self):
# `load` loads input data. In this case, this loads the output of `TaskA`.
output_of_task_a = self.load()
results = f'Task A: {output_of_task_a}\nTaskB: param={self.param}'
# `dump` writes `results` to the file path of `self.output()`.
self.dump(results)
if __name__ == '__main__':
print(gokart.build([TaskB(param='Hello')]))
The result of this script will look like this
.. code:: sh
Task A: param=world
Task B: param=Hello
The results are obtained as a pipeline by linking A and B.
TaskOnKart.make_target
----------------------
The :func:`~gokart.task.TaskOnKart.make_target` method is used to make an instance of ``Luigi.Target``.
For instance, an example implementation could be as follows:
.. code:: python
def output(self):
return self.make_target('file_name.pkl')
The ``make_target`` method adds ``_{self.make_unique_id()}`` to the file name as suffix.
In this case, the target file path is ``{self.workspace_directory}/file_name_{self.make_unique_id()}.pkl``.
It is also possible to specify a file format other than pkl. The supported file formats are as follows:
- .pkl
- .txt
- .csv
- .tsv
- .gz
- .json
- .xml
- .npz
- .parquet
- .feather
- .png
- .jpg
- .ini
If dump something other than the above, can use :func:`~gokart.TaskOnKart.make_model_target`.
Please refer to :func:`~gokart.task.TaskOnKart.make_target` and described later Advanced Features section.
.. note::
By default, file path is inferred from "__name__" of the script, so ``output`` method can be omitted.
Please refer to :doc:`tutorial` section.
.. note::
When using `.feather`, index will be converted to column at saving and restored to index at loading.
If you don't prefere saving index, set `store_index_in_feather=False` parameter at `gokart.target.make_target()`.
.. note::
When you set `serialized_task_definition_check=True`, the task will rerun when you modify the scripts of the task.
Please note that the scripts outside the class are not considered.
TaskOnKart.load
----------------
The :func:`~gokart.task.TaskOnKart.load` method is used to load input data.
For instance, an example implementation could be as follows:
.. code:: python
def requires(self):
return TaskA(param='called by TaskB')
def run(self):
# `load` loads input data. In this case, this loads the output of `TaskA`.
output_of_task_a = self.load()
In the case that a task requires 2 or more tasks as input, the return value of this method has the same structure with `requires` value.
For instance, an example implementation that `requires` returns a dictionary of tasks could be like follows:
.. code:: python
def requires(self):
return dict(a=TaskA(), b=TaskB())
def run(self):
data = self.load() # returns dict(a=self.load('a'), b=self.load('b'))
The `load` method loads individual task input by passing a key of an input dictionary as follows:
.. code:: python
def run(self):
data_a = self.load('a')
data_b = self.load('b')
As an alternative, the `load` method loads individual task input by passing an instance of TaskOnKart as follows:
.. code:: python
def run(self):
data_a = self.load(TaskA())
data_b = self.load(TaskB())
We can also omit the :func:`~gokart.task.TaskOnKart.requires` and write the task used by :func:`~gokart.parameter.TaskInstanceParameter`.
Also please refer to :func:`~gokart.task.TaskOnKart.load`, :doc:`task_parameters`, and described later Advanced Features section.
TaskOnKart.dump
----------------
The :func:`~gokart.task.TaskOnKart.dump` method is used to dump results of tasks.
For instance, an example implementation could be as follows:
.. code:: python
def output(self):
return self.make_target('output.pkl')
def run(self):
results = do_something(self.load())
self.dump(results)
In the case that a task has 2 or more output, it is possible to specify output target by passing a key of dictionary like follows:
.. code:: python
def output(self):
return dict(a=self.make_target('output_a.pkl'), b=self.make_target('output_b.pkl'))
def run(self):
a_data = do_something_a(self.load())
b_data = do_something_b(self.load())
self.dump(a_data, 'a')
self.dump(b_data, 'b')
Please refer to :func:`~gokart.task.TaskOnKart.dump`.
Advanced Features
---------------------
TaskOnKart.load_generator
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :func:`~gokart.task.TaskOnKart.load_generator` method is used to load input data with generator.
For instance, an example implementation could be as follows:
.. code:: python
def requires(self):
return TaskA(param='called by TaskB')
def run(self):
for data in self.load_generator():
any_process(data)
Usage is the same as `TaskOnKart.generator`.
`load_generator` reads the divided file into iterations.
It's effective when can't read all data to memory, because `load_generator` doesn't load all files at once.
Please refer to :func:`~gokart.task.TaskOnKart.load_generator`.
TaskOnKart.make_model_target
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :func:`~gokart.task.TaskOnKart.make_model_target` method is used to dump for non supported file types.
.. code:: python
import gensim
class TrainWord2Vec(gokart.TaskOnKart[Word2VecResult]):
def output(self):
# please use 'zip'.
return self.make_model_target(
'model.zip',
save_function=gensim.model.Word2Vec.save,
load_function=gensim.model.Word2Vec.load)
def run(self):
# -- train word2vec ---
word2vec = train_word2vec()
self.dump(word2vec)
It is dumped and zipped with ``gensim.model.Word2Vec.save``.
Please refer to :func:`~gokart.task.TaskOnKart.make_model_target`.
TaskOnKart.fail_on_empty_dump
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Please refer to :doc:`for_pandas`.
TaskOnKart.should_dump_supplementary_log_files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Whether to dump supplementary files (task_log, random_seed, task_params, processing_time, module_versions) or not. Default is True.
Note that when set to False, task_info functions (e.g. gokart.tree.task_info.make_task_info_as_tree_str()) cannot be used.
Dump csv with encoding
~~~~~~~~~~~~~~~~~~~~~~~
You can dump csv file by implementing `Task.output()` method as follows:
.. code:: python
def output(self):
return self.make_target('file_name.csv')
By default, csv file is dumped with `utf-8` encoding.
If you want to dump csv file with other encodings, you can use `encoding` parameter as follows:
.. code:: python
from gokart.file_processor import CsvFileProcessor
def output(self):
return self.make_target('file_name.csv', processor=CsvFileProcessor(encoding='cp932'))
# This will dump csv as 'cp932' which is used in Windows.
Cache output in memory instead of dumping to files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can use :class:`~InMemoryTarget` to cache output in memory instead of dumping to files by calling :func:`~gokart.target.make_in_memory_target`.
Please note that :class:`~InMemoryTarget` is an experimental feature.
.. code:: python
from gokart.in_memory.target import make_in_memory_target
def output(self):
unique_id = self.make_unique_id() if use_unique_id else None
# TaskLock is not supported in InMemoryTarget, so it's dummy
task_lock_params = make_task_lock_params(
file_path='dummy_path',
unique_id=unique_id,
redis_host=None,
redis_port=None,
redis_timeout=self.redis_timeout,
raise_task_lock_exception_on_collision=False,
)
return make_in_memory_target('dummy_path', task_lock_params, unique_id)
================================================
FILE: docs/task_parameters.rst
================================================
=================
Task Parameters
=================
Luigi Parameter
================
We can set parameters for tasks.
Also please refer to :doc:`task_settings` section.
.. code:: python
class Task(gokart.TaskOnKart):
param_a: luigi.Parameter = luigi.Parameter()
param_c: luigi.ListParameter = luigi.ListParameter()
param_d: luigi.IntParameter = luigi.IntParameter(default=1)
Please refer to `luigi document <https://luigi.readthedocs.io/en/stable/api/luigi.parameter.html>`_ for a list of parameter types.
Gokart Parameter
================
There are also parameters provided by gokart.
- gokart.TaskInstanceParameter
- gokart.ListTaskInstanceParameter
- gokart.ExplicitBoolParameter
gokart.TaskInstanceParameter
--------------------------------
The :func:`~gokart.parameter.TaskInstanceParameter` executes a task using the results of a task as dynamic parameters.
.. code:: python
class TaskA(gokart.TaskOnKart[str]):
def run(self):
self.dump('Hello')
class TaskB(gokart.TaskOnKart[str]):
require_task: gokart.TaskInstanceParameter = gokart.TaskInstanceParameter()
def requires(self):
return self.require_task
def run(self):
task_a = self.load()
self.dump(','.join([task_a, 'world']))
task = TaskB(require_task=TaskA())
print(gokart.build(task)) # Hello,world
Helps to create a pipeline.
gokart.ListTaskInstanceParameter
-------------------------------------
The :func:`~gokart.parameter.ListTaskInstanceParameter` is list of TaskInstanceParameter.
gokart.ExplicitBoolParameter
-----------------------------------
The :func:`~gokart.parameter.ExplicitBoolParameter` is parameter for explicitly specified value.
``luigi.BoolParameter`` already has "explicit parsing" feature, but also still has implicit behavior like follows.
::
$ python main.py Task --param
# param will be set as True
$ python main.py Task
# param will be set as False
``ExplicitBoolParameter`` solves these problems on parameters from command line.
gokart.SerializableParameter
----------------------------
The :func:`~gokart.parameter.SerializableParameter` is a parameter for any object that can be serialized and deserialized.
This parameter is particularly useful when you want to pass a complex object or a set of parameters to a task.
The object must implement the following methods:
- ``gokart_serialize``: Serialize the object to a string. This serialized string must uniquely identify the object to enable task caching.
Note that it is not required for deserialization.
- ``gokart_deserialize``: Deserialize the object from a string, typically used for CLI arguments.
Example
^^^^^^^
.. code-block:: python
import json
from dataclasses import dataclass
import gokart
@dataclass(frozen=True)
class Config:
foo: int
# The `bar` field does not affect the result of the task.
# Similar to `luigi.Parameter(significant=False)`.
bar: str
def gokart_serialize(self) -> str:
# Serialize only the `foo` field since `bar` is irrelevant for caching.
return json.dumps({'foo': self.foo})
@classmethod
def gokart_deserialize(cls, s: str) -> 'Config':
# Deserialize the object from the provided string.
return cls(**json.loads(s))
class DummyTask(gokart.TaskOnKart):
config: gokart.SerializableParameter[Config] = gokart.SerializableParameter(object_type=Config)
def run(self):
# Save the `config` object as part of the task result.
self.dump(self.config)
================================================
FILE: docs/task_settings.rst
================================================
Task Settings
=============
Task settings. Also please refer to :doc:`task_parameters` section.
Directory to Save Outputs
-------------------------
We can use both a local directory and the S3 to save outputs.
If you would like to use local directory, please set a local directory path to :attr:`~gokart.task.TaskOnKart.workspace_directory`. Please refer to :doc:`task_parameters` for how to set it up.
It is recommended to use the config file since it does not change much.
::
# base.ini
[TaskOnKart]
workspace_directory=${TASK_WORKSPACE_DIRECTORY}
.. code:: python
# main.py
import gokart
gokart.add_config('base.ini')
To use the S3 or GCS repository, please set the bucket path as ``s3://{YOUR_REPOSITORY_NAME}`` or ``gs://{YOUR_REPOSITORY_NAME}`` respectively.
If use S3 or GCS, please set credential information to Environment Variables.
.. code:: sh
# S3
export AWS_ACCESS_KEY_ID='~~~' # AWS access key
export AWS_SECRET_ACCESS_KEY='~~~' # AWS secret access key
# GCS
export GCS_CREDENTIAL='~~~' # GCS credential
export DISCOVER_CACHE_LOCAL_PATH='~~~' # The local file path of discover api cache.
Rerun task
----------
There are times when we want to rerun a task, such as when change script or on batch. Please use the ``rerun`` parameter or add an arbitrary parameter.
When set rerun as follows:
.. code:: python
# rerun TaskA
gokart.build(Task(rerun=True))
When used from an argument as follows:
.. code:: python
# main.py
class Task(gokart.TaskOnKart[str]):
def run(self):
self.dump('hello')
.. code:: sh
python main.py Task --local-scheduler --rerun
``rerun`` parameter will look at the dependent tasks up to one level.
Example: Suppose we have a straight line pipeline composed of TaskA, TaskB and TaskC, and TaskC is an endpoint of this pipeline. We also suppose that all the tasks have already been executed.
- TaskA(rerun=True) -> TaskB -> TaskC # not rerunning
- TaskA -> TaskB(rerun=True) -> TaskC # rerunning TaskB and TaskC
This is due to the way intermediate files are handled. ``rerun`` parameter is ``significant=False``, it does not affect the hash value. It is very important to understand this difference.
If you want to change the parameter of TaskA and rerun TaskB and TaskC, recommend adding an arbitrary parameter.
.. code:: python
class TaskA(gokart.TaskOnKart):
__version: luigi.IntParameter = luigi.IntParameter(default=1)
If the hash value of TaskA will change, the dependent tasks (in this case, TaskB and TaskC) will rerun.
Fix random seed
---------------
Every task has a parameter named :attr:`~gokart.task.TaskOnKart.fix_random_seed_methods` and :attr:`~gokart.task.TaskOnKart.fix_random_seed_value`. This can be used to fix the random seed.
.. code:: python
import gokart
import random
import numpy
import torch
class Task(gokart.TaskOnKart[dict[str, Any]]):
def run(self):
x = [random.randint(0, 100) for _ in range(0, 10)]
y = [np.random.randint(0, 100) for _ in range(0, 10)]
z = [torch.randn(1).tolist()[0] for _ in range(0, 5)]
self.dump({'random': x, 'numpy': y, 'torch': z})
gokart.build(
Task(
fix_random_seed_methods=[
"random.seed",
"numpy.random.seed",
"torch.random.manual_seed"],
fix_random_seed_value=57))
::
# //--- The output is as follows every time. ---
# {'random': [65, 41, 61, 37, 55, 81, 48, 2, 94, 21],
# 'numpy': [79, 86, 5, 22, 79, 98, 56, 40, 81, 37], 'torch': []}
# 'torch': [0.14460121095180511, -0.11649507284164429,
# 0.6928958296775818, -0.916053831577301, 0.7317505478858948]}
This will be useful for using Machine Learning Libraries.
================================================
FILE: docs/tutorial.rst
================================================
Tutorial
========
Also please refer to :doc:`intro_to_gokart` section.
1, Make gokart project
----------------------
Create a project using `cookiecutter-gokart <https://github.com/m3dev/cookiecutter-gokart>`_.
.. code:: sh
cookiecutter https://github.com/m3dev/cookiecutter-gokart
# project_name [project_name]: example
# package_name [package_name]: gokart_example
# python_version [3.7.0]:
# author [your name]: m3dev
# package_description [What's this project?]: gokart example
# license [MIT License]:
You will have a directory tree like following:
.. code:: sh
tree example/
example/
├── Dockerfile
├── README.md
├── conf
│ ├── logging.ini
│ └── param.ini
├── gokart_example
│ ├── __init__.py
│ ├── model
│ │ ├── __init__.py
│ │ └── sample.py
│ └── utils
│ └── template.py
├── main.py
├── pyproject.toml
└── test
├── __init__.py
└── unit_test
└── test_sample.py
2, Running sample task
----------------------
Let's run the first task.
.. code:: sh
python main.py gokart_example.Sample --local-scheduler
The results are stored in resources directory.
.. code:: sh
tree resources
resources/
├── gokart_example
│ └── model
│ └── sample
│ └── Sample_cdf55a3d6c255d8c191f5f472da61f99.pkl
└── log
├── module_versions
│ └── Sample_cdf55a3d6c255d8c191f5f472da61f99.txt
├── processing_time
│ └── Sample_cdf55a3d6c255d8c191f5f472da61f99.pkl
├── random_seed
│ └── Sample_cdf55a3d6c255d8c191f5f472da61f99.pkl
├── task_log
│ └── Sample_cdf55a3d6c255d8c191f5f472da61f99.pkl
└── task_params
└── Sample_cdf55a3d6c255d8c191f5f472da61f99.pkl
Please refer to :doc:`intro_to_gokart` for output
.. note::
It is better to use poetry in terms of the module version. Please refer to `poetry document <https://python-poetry.org/docs/>`_
.. code:: sh
poetry lock
poetry run python main.py gokart_example.Sample --local-scheduler
If want to stabilize it further, please use docker.
.. code:: sh
docker build -t sample .
docker run -it sample "python main.py gokart_example.Sample --local-scheduler"
3, Check result
---------------
Check the output.
.. code:: python
with open('resources/gokart_example/model/sample/Sample_cdf55a3d6c255d8c191f5f472da61f99.pkl', 'rb') as f:
print(pickle.load(f)) # sample output
4, Run unittest
------------------
It is important to run unittest before and after modifying the code.
.. code:: sh
python -m unittest discover -s ./test/unit_test/
.
----------------------------------------------------------------------
Ran 1 test in 0.001s
OK
5, Create Task
--------------
Writing gokart-like tasks.
Modify ``example/gokart_example/model/sample.py`` as follows:
.. code:: python
from logging import getLogger
import gokart
from gokart_example.utils.template import GokartTask
logger = getLogger(__name__)
class Sample(GokartTask):
def run(self):
self.dump('sample output')
class StringToSplit(GokartTask):
"""Like the function to divide received data by spaces."""
task: gokart.TaskInstanceParameter = gokart.TaskInstanceParameter()
def run(self):
sample = self.load('task')
self.dump(sample.split(' '))
class Main(GokartTask):
"""Endpoint task."""
def requires(self):
return StringToSplit(task=Sample())
Added ``Main`` and ``StringToSplit``. ``StringToSplit`` is a function-like task that loads the result of an arbitrary task and splits it by spaces. ``Main`` is injecting ``Sample`` into ``StringToSplit``. It like Endpoint.
Let’s run the ``Main`` task.
.. code:: sh
python main.py gokart_example.Main --local-scheduler
Please take a look at the logger output at this time.
::
===== Luigi Execution Summary =====
Scheduled 3 tasks of which:
* 1 complete ones were encountered:
- 1 gokart_example.Sample(...)
* 2 ran successfully:
- 1 gokart_example.Main(...)
- 1 gokart_example.StringToSplit(...)
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
As the log shows, ``Sample`` has been executed once, so the ``cache`` will be used.
The only things that worked were ``Main`` and ``StringToSplit``.
The output will look like the following, with the result in ``StringToSplit_b8a0ce6c972acbd77eae30f35da4307e.pkl``.
::
tree resources/
resources/
├── gokart_example
│ └── model
│ └── sample
│ ├── Sample_cdf55a3d6c255d8c191f5f472da61f99.pkl
│ └── StringToSplit_b8a0ce6c972acbd77eae30f35da4307e.pkl
...
.. code:: python
with open('resources/gokart_example/model/sample/StringToSplit_b8a0ce6c972acbd77eae30f35da4307e.pkl', 'rb') as f:
print(pickle.load(f)) # ['sample', 'output']
It was able to move the added task.
6, Rerun Task
-------------
Finally, let's rerun the task.
There are two ways to rerun a task.
Change the ``rerun parameter`` or ``parameters of the dependent tasks``.
``gokart.TaskOnKart`` can set ``rerun parameter`` for each task like following:
.. code:: python
class Main(GokartTask):
rerun=True
def requires(self):
return StringToSplit(task=Sample(rerun=True), rerun=True)
OR
Add new parameter on dependent tasks like following:
.. code:: python
class Sample(GokartTask):
version: luigi.IntParameter = luigi.IntParameter(default=1)
def run(self):
self.dump('sample output version {self.version}')
In both cases, all tasks will be rerun.
The difference is hash value given to output files.
The reurn parameter has no effect on the hash value.
So it will be rerun with the same hash value.
In the second method, ``version parameter`` is added to the ``Sample`` task.
This parameter will change the hash value of ``Sample`` and generate another output file.
And the dependent task, ``StringToSplit``, will also have a different hash value, and rerun.
Please refer to :doc:`task_settings` for details.
Please try rerunning task at hand:)
Feature
-------
This is the end of the gokart tutorial.
The tutorial is an introduction to some of the features.
There are still more useful features.
Please See :doc:`task_on_kart` section, :doc:`for_pandas` section and :doc:`task_parameters` section for more useful features of the task.
Have a good gokart life.
================================================
FILE: docs/using_task_task_conflict_prevention_lock.rst
================================================
Task conflict prevention lock
=========================
If there is a possibility of multiple worker nodes executing the same task, task cache conflict may happen.
Specifically, while node A is loading the cache of a task, node B may be writing to it.
This can lead to reading an inappropriate data and other unwanted behaviors.
The redis lock introduced in this page is a feature to prevent such cache collisions.
Requires
--------
You need to install `redis <https://redis.io/topics/quickstart>`_ for using this advanced feature.
How to use
-----------
1. Set up a redis server at somewhere accessible from gokart/luigi jobs.
e.g. Following script will run redis at your localhost.
.. code:: bash
$ redis-server
2. Set redis server hostname and port number as parameters of gokart.TaskOnKart().
You can set it by adding ``--redis-host=[your-redis-localhost] --redis-port=[redis-port-number]`` options to gokart python script.
e.g.
.. code:: bash
python main.py sample.SomeTask --local-scheduler --redis-host=localhost --redis-port=6379
Alternatively, you may set parameters at config file.
e.g.
.. code::
[TaskOnKart]
redis_host=localhost
redis_port=6379
3. Done
With the above configuration, all tasks that inherits gokart.TaskOnKart will ask the redis server if any other node is not trying to access the same cache file at the same time whenever they access the file with dump or load.
================================================
FILE: examples/gokart_notebook_example.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gokart 1.0.2\n"
]
}
],
"source": [
"!pip list | grep gokart"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import gokart\n",
"import luigi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Examples of using gokart at jupyter notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Usage\n",
"This is a very basic usage, just to dump a run result of ExampleTaskA."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"example_2\n"
]
}
],
"source": [
"class ExampleTaskA(gokart.TaskOnKart):\n",
" param = luigi.Parameter()\n",
" int_param = luigi.IntParameter(default=2)\n",
"\n",
" def run(self):\n",
" self.dump(f'DONE {self.param}_{self.int_param}')\n",
"\n",
" \n",
"task_a = ExampleTaskA(param='example')\n",
"output = gokart.build(task=task_a)\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Make tasks dependencies with `requires()`\n",
"ExampleTaskB is dependent on ExampleTaskC and ExampleTaskD. They are defined in `ExampleTaskB.requires()`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DONE example_TASKC_TASKD\n"
]
}
],
"source": [
"class ExampleTaskC(gokart.TaskOnKart):\n",
" def run(self):\n",
" self.dump('TASKC')\n",
" \n",
"class ExampleTaskD(gokart.TaskOnKart):\n",
" def run(self):\n",
" self.dump('TASKD')\n",
"\n",
"class ExampleTaskB(gokart.TaskOnKart):\n",
" param = luigi.Parameter()\n",
"\n",
" def requires(self):\n",
" return dict(task_c=ExampleTaskC(), task_d=ExampleTaskD())\n",
"\n",
" def run(self):\n",
" task_c = self.load('task_c')\n",
" task_d = self.load('task_d')\n",
" self.dump(f'DONE {self.param}_{task_c}_{task_d}')\n",
" \n",
"task_b = ExampleTaskB(param='example')\n",
"output = gokart.build(task=task_b)\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Make tasks dependencies with TaskInstanceParameter\n",
"The dependencies are same as previous example, however they are defined at the outside of the task instead of defied at `ExampleTaskB.requires()`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DONE example_TASKC_TASKD\n"
]
}
],
"source": [
"class ExampleTaskC(gokart.TaskOnKart):\n",
" def run(self):\n",
" self.dump('TASKC')\n",
" \n",
"class ExampleTaskD(gokart.TaskOnKart):\n",
" def run(self):\n",
" self.dump('TASKD')\n",
"\n",
"class ExampleTaskB(gokart.TaskOnKart):\n",
" param = luigi.Parameter()\n",
" task_1 = gokart.TaskInstanceParameter()\n",
" task_2 = gokart.TaskInstanceParameter()\n",
"\n",
" def requires(self):\n",
" return dict(task_1=self.task_1, task_2=self.task_2) # required tasks are decided from the task parameters `task_1` and `task_2`\n",
"\n",
" def run(self):\n",
" task_1 = self.load('task_1')\n",
" task_2 = self.load('task_2')\n",
" self.dump(f'DONE {self.param}_{task_1}_{task_2}')\n",
" \n",
"task_b = ExampleTaskB(param='example', task_1=ExampleTaskC(), task_2=ExampleTaskD()) # Dependent tasks are defined here\n",
"output = gokart.build(task=task_b)\n",
"print(output)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.8 64-bit ('3.8.8': pyenv)",
"name": "python388jvsc74a57bd026997db2bf0f03e18da4e606f276befe0d6bf7cab2a6bb74742969d5bbde02ca"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
},
"metadata": {
"interpreter": {
"hash": "26997db2bf0f03e18da4e606f276befe0d6bf7cab2a6bb74742969d5bbde02ca"
}
},
"orig_nbformat": 3
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: examples/logging.ini
================================================
[loggers]
keys=root,luigi,luigi-interface,gokart
[handlers]
keys=stderrHandler
[formatters]
keys=simpleFormatter
[logger_root]
level=INFO
handlers=stderrHandler
[logger_gokart]
level=INFO
handlers=stderrHandler
qualname=gokart
propagate=0
[logger_luigi]
level=INFO
handlers=stderrHandler
qualname=luigi
propagate=0
[logger_luigi-interface]
level=INFO
handlers=stderrHandler
qualname=luigi-interface
propagate=0
[handler_stderrHandler]
class=StreamHandler
formatter=simpleFormatter
args=(sys.stdout,)
[formatter_simpleFormatter]
format=level=%(levelname)s time=%(asctime)s name=%(name)s file=%(filename)s line=%(lineno)d message=%(message)s
datefmt=%Y/%m/%d %H:%M:%S
class=logging.Formatter
================================================
FILE: examples/param.ini
================================================
[TaskOnKart]
workspace_directory=./resource
local_temporary_directory=./resource/tmp
[core]
logging_conf_file=logging.ini
================================================
FILE: gokart/__init__.py
================================================
__all__ = [
'build',
'WorkerSchedulerFactory',
'make_tree_info',
'tree_info',
'PandasTypeConfig',
'ExplicitBoolParameter',
'ListTaskInstanceParameter',
'SerializableParameter',
'TaskInstanceParameter',
'ZonedDateSecondParameter',
'run',
'TaskOnKart',
'test_run',
'make_task_info_as_tree_str',
'add_config',
'delete_local_unnecessary_outputs',
]
from gokart.build import WorkerSchedulerFactory, build
from gokart.info import make_tree_info, tree_info
from gokart.pandas_type_config import PandasTypeConfig
from gokart.parameter import (
ExplicitBoolParameter,
ListTaskInstanceParameter,
SerializableParameter,
TaskInstanceParameter,
ZonedDateSecondParameter,
)
from gokart.run import run
from gokart.task import TaskOnKart
from gokart.testing import test_run
from gokart.tree.task_info import make_task_info_as_tree_str
from gokart.utils import add_config
from gokart.workspace_management import delete_local_unnecessary_outputs
================================================
FILE: gokart/build.py
================================================
from __future__ import annotations
import enum
import io
import logging
from dataclasses import dataclass
from functools import partial
from logging import getLogger
from typing import Any, Literal, Protocol, TypeVar, cast, overload
import backoff
import luigi
from luigi import LuigiStatusCode, rpc, scheduler
import gokart
import gokart.tree.task_info
from gokart import worker
from gokart.conflict_prevention_lock.task_lock import TaskLockException
from gokart.target import TargetOnKart
from gokart.task import TaskOnKart
T = TypeVar('T')
logger: logging.Logger = logging.getLogger(__name__)
class LoggerConfig:
def __init__(self, level: int):
self.logger = getLogger(__name__)
self.default_level = self.logger.level
self.level = level
def __enter__(self):
logging.disable(self.level - 10) # subtract 10 to disable below self.level
self.logger.setLevel(self.level)
return self
def __exit__(self, exception_type, exception_value, traceback):
logging.disable(self.default_level - 10) # subtract 10 to disable below self.level
self.logger.setLevel(self.default_level)
class GokartBuildError(Exception):
"""Raised when ``gokart.build`` failed. This exception contains raised exceptions in the task execution."""
def __init__(self, message: str, raised_exceptions: dict[str, list[Exception]]) -> None:
super().__init__(message)
self.raised_exceptions = raised_exceptions
class HasLockedTaskException(Exception):
"""Raised when the task failed to acquire the lock in the task execution."""
class TaskLockExceptionRaisedFlag:
def __init__(self):
self.flag: bool = False
class WorkerProtocol(Protocol):
"""Protocol for Worker.
This protocol is determined by luigi.worker.Worker.
"""
def add(self, task: TaskOnKart[Any]) -> bool: ...
def run(self) -> bool: ...
def __enter__(self) -> WorkerProtocol: ...
def __exit__(self, type: Any, value: Any, traceback: Any) -> Literal[False]: ...
class WorkerSchedulerFactory:
def create_local_scheduler(self) -> scheduler.Scheduler:
return scheduler.Scheduler(prune_on_get_work=True, record_task_history=False)
def create_remote_scheduler(self, url: str) -> rpc.RemoteScheduler:
return rpc.RemoteScheduler(url)
def create_worker(self, scheduler: scheduler.Scheduler, worker_processes: int, assistant: bool = False) -> WorkerProtocol:
return worker.Worker(scheduler=scheduler, worker_processes=worker_processes, assistant=assistant)
def _get_output(task: TaskOnKart[T]) -> T:
output = task.output()
# FIXME: currently, nested output is not supported
if isinstance(output, list) or isinstance(output, tuple):
return cast(T, [t.load() for t in output if isinstance(t, TargetOnKart)])
if isinstance(output, dict):
return cast(T, {k: t.load() for k, t in output.items() if isinstance(t, TargetOnKart)})
if isinstance(output, TargetOnKart):
return cast(T, output.load())
raise ValueError(f'output type is not supported: {type(output)}')
def _reset_register(keep={'gokart', 'luigi'}):
"""reset luigi.task_register.Register._reg everytime gokart.build called to avoid TaskClassAmbigiousException"""
luigi.task_register.Register._reg = [
x
for x in luigi.task_register.Register._reg
if (
(x.__module__.split('.')[0] in keep) # keep luigi and gokart
or (issubclass(x, gokart.PandasTypeConfig))
) # PandasTypeConfig should be kept
]
class TaskDumpMode(enum.Enum):
TREE = 'tree'
TABLE = 'table'
NONE = 'none'
class TaskDumpOutputType(enum.Enum):
PRINT = 'print'
DUMP = 'dump'
NONE = 'none'
@dataclass
class TaskDumpConfig:
mode: TaskDumpMode = TaskDumpMode.NONE
output_type: TaskDumpOutputType = TaskDumpOutputType.NONE
def process_task_info(task: TaskOnKart[Any], task_dump_config: TaskDumpConfig = TaskDumpConfig()) -> None:
match task_dump_config:
case TaskDumpConfig(mode=TaskDumpMode.NONE, output_type=TaskDumpOutputType.NONE):
pass
case TaskDumpConfig(mode=TaskDumpMode.TREE, output_type=TaskDumpOutputType.PRINT):
tree = gokart.make_tree_info(task)
logger.info(tree)
case TaskDumpConfig(mode=TaskDumpMode.TABLE, output_type=TaskDumpOutputType.PRINT):
table = gokart.tree.task_info.make_task_info_as_table(task)
output = io.StringIO()
table.to_csv(output, index=False, sep='\t')
output.seek(0)
logger.info(output.read())
case TaskDumpConfig(mode=TaskDumpMode.TREE, output_type=TaskDumpOutputType.DUMP):
tree = gokart.make_tree_info(task)
gokart.TaskOnKart().make_target(f'log/task_info/{type(task).__name__}.txt').dump(tree)
case TaskDumpConfig(mode=TaskDumpMode.TABLE, output_type=TaskDumpOutputType.DUMP):
table = gokart.tree.task_info.make_task_info_as_table(task)
gokart.TaskOnKart().make_target(f'log/task_info/{type(task).__name__}.pkl').dump(table)
case _:
raise ValueError(f'Unsupported TaskDumpConfig: {task_dump_config}')
@overload
def build(
task: TaskOnKart[T],
return_value: Literal[True] = True,
reset_register: bool = True,
log_level: int = logging.ERROR,
task_lock_exception_max_tries: int = 10,
task_lock_exception_max_wait_seconds: int = 600,
**env_params: Any,
) -> T: ...
@overload
def build(
task: TaskOnKart[T],
return_value: Literal[False],
reset_register: bool = True,
log_level: int = logging.ERROR,
task_lock_exception_max_tries: int = 10,
task_lock_exception_max_wait_seconds: int = 600,
**env_params: Any,
) -> None: ...
def build(
task: TaskOnKart[T],
return_value: bool = True,
reset_register: bool = True,
log_level: int = logging.ERROR,
task_lock_exception_max_tries: int = 10,
task_lock_exception_max_wait_seconds: int = 600,
task_dump_config: TaskDumpConfig = TaskDumpConfig(),
**env_params: Any,
) -> T | None:
"""
Run gokart task for local interpreter.
Sharing the most of its parameters with luigi.build (see https://luigi.readthedocs.io/en/stable/api/luigi.html?highlight=build#luigi.build)
"""
if reset_register:
_reset_register()
with LoggerConfig(level=log_level):
log_handler_before_run = logging.StreamHandler()
logger.addHandler(log_handler_before_run)
process_task_info(task, task_dump_config)
logger.removeHandler(log_handler_before_run)
log_handler_before_run.close()
task_lock_exception_raised = TaskLockExceptionRaisedFlag()
raised_exceptions: dict[str, list[Exception]] = dict()
@TaskOnKart.event_handler(luigi.Event.FAILURE)
def when_failure(task, exception):
if isinstance(exception, TaskLockException):
task_lock_exception_raised.flag = True
else:
raised_exceptions.setdefault(task.make_unique_id(), []).append(exception)
@backoff.on_exception(
partial(backoff.expo, max_value=task_lock_exception_max_wait_seconds), HasLockedTaskException, max_tries=task_lock_exception_max_tries
)
def _build_task():
task_lock_exception_raised.flag = False
result = luigi.build(
[task],
worker_scheduler_factory=WorkerSchedulerFactory(),
local_scheduler=True,
detailed_summary=True,
log_level=logging.getLevelName(log_level),
**env_params,
)
if task_lock_exception_raised.flag:
raise HasLockedTaskException()
if result.status in (LuigiStatusCode.FAILED, LuigiStatusCode.FAILED_AND_SCHEDULING_FAILED, LuigiStatusCode.SCHEDULING_FAILED):
raise GokartBuildError(result.summary_text, raised_exceptions=raised_exceptions)
return _get_output(task) if return_value else None
return cast(T | None, _build_task())
================================================
FILE: gokart/config_params.py
================================================
from __future__ import annotations
from typing import Any
import luigi
import gokart
class inherits_config_params:
def __init__(self, config_class: type[luigi.Config], parameter_alias: dict[str, str] | None = None):
"""
Decorates task to inherit parameter value of `config_class`.
* config_class: Inherit parameter value of this task to decorated task. Only parameter values exist in both tasks are inherited.
* parameter_alias: Dictionary to map paramter names between config_class task and decorated task.
key: config_class's parameter name. value: decorated task's parameter name.
"""
self._config_class: type[luigi.Config] = config_class
self._parameter_alias: dict[str, str] = parameter_alias if parameter_alias is not None else {}
def __call__(self, task_class: type[gokart.TaskOnKart[Any]]) -> type[gokart.TaskOnKart[Any]]:
# wrap task to prevent task name from being changed
@luigi.task._task_wraps(task_class)
class Wrapped(task_class): # type: ignore
@classmethod
def get_param_values(cls, params, args, kwargs):
for param_key, param_value in self._config_class().param_kwargs.items():
task_param_key = self._parameter_alias.get(param_key, param_key)
if hasattr(cls, task_param_key) and task_param_key not in kwargs:
kwargs[task_param_key] = param_value
return super().get_param_values(params, args, kwargs)
return Wrapped
================================================
FILE: gokart/conflict_prevention_lock/task_lock.py
================================================
from __future__ import annotations
import functools
import os
from logging import getLogger
from typing import Any, NamedTuple
import redis
from apscheduler.schedulers.background import BackgroundScheduler
logger = getLogger(__name__)
class TaskLockParams(NamedTuple):
redis_host: str | None
redis_port: int | None
redis_timeout: int | None
redis_key: str
should_task_lock: bool
raise_task_lock_exception_on_collision: bool
lock_extend_seconds: int
class TaskLockException(Exception):
pass
"""Raised when the task failed to acquire the lock in the task execution. Only used internally."""
class RedisClient:
_instances: dict[Any, Any] = {}
def __new__(cls, *args, **kwargs):
key = (args, tuple(sorted(kwargs.items())))
if cls not in cls._instances:
cls._instances[cls] = {}
if key not in cls._instances[cls]:
cls._instances[cls][key] = super().__new__(cls)
return cls._instances[cls][key]
def __init__(self, host: str | None, port: int | None) -> None:
if not hasattr(self, '_redis_client'):
host = host or 'localhost'
port = port or 6379
self._redis_client = redis.Redis(host=host, port=port)
def get_redis_client(self):
return self._redis_client
def _extend_lock(task_lock: redis.lock.Lock, redis_timeout: int) -> None:
task_lock.extend(additional_time=redis_timeout, replace_ttl=True)
def set_task_lock(task_lock_params: TaskLockParams) -> redis.lock.Lock:
redis_client = RedisClient(host=task_lock_params.redis_host, port=task_lock_params.redis_port).get_redis_client()
blocking = not task_lock_params.raise_task_lock_exception_on_collision
task_lock = redis.lock.Lock(redis=redis_client, name=task_lock_params.redis_key, timeout=task_lock_params.redis_timeout, thread_local=False)
if not task_lock.acquire(blocking=blocking):
raise TaskLockException('Lock already taken by other task.')
return task_lock
def set_lock_scheduler(task_lock: redis.lock.Lock, task_lock_params: TaskLockParams) -> BackgroundScheduler:
scheduler = BackgroundScheduler()
extend_lock = functools.partial(_extend_lock, task_lock=task_lock, redis_timeout=task_lock_params.redis_timeout or 0)
scheduler.add_job(
extend_lock,
'interval',
seconds=task_lock_params.lock_extend_seconds,
max_instances=999999999,
misfire_grace_time=task_lock_params.redis_timeout,
coalesce=False,
)
scheduler.start()
return scheduler
def make_task_lock_key(file_path: str, unique_id: str | None) -> str:
basename_without_ext = os.path.splitext(os.path.basename(file_path))[0]
return f'{basename_without_ext}_{unique_id}'
def make_task_lock_params(
file_path: str,
unique_id: str | None,
redis_host: str | None = None,
redis_port: int | None = None,
redis_timeout: int | None = None,
raise_task_lock_exception_on_collision: bool = False,
lock_extend_seconds: int = 10,
) -> TaskLockParams:
redis_key = make_task_lock_key(file_path, unique_id)
should_task_lock = redis_host is not None and redis_port is not None
if redis_timeout is not None:
assert redis_timeout > lock_extend_seconds, f'`redis_timeout` must be set greater than lock_extend_seconds:{lock_extend_seconds}, not {redis_timeout}.'
task_lock_params = TaskLockParams(
redis_host=redis_host,
redis_port=redis_port,
redis_key=redis_key,
should_task_lock=should_task_lock,
redis_timeout=redis_timeout,
raise_task_lock_exception_on_collision=raise_task_lock_exception_on_collision,
lock_extend_seconds=lock_extend_seconds,
)
return task_lock_params
def make_task_lock_params_for_run(task_self: Any, lock_extend_seconds: int = 10) -> TaskLockParams:
task_path_name = os.path.join(task_self.__module__.replace('.', '/'), f'{type(task_self).__name__}')
unique_id = task_self.make_unique_id() + '-run'
task_lock_key = make_task_lock_key(file_path=task_path_name, unique_id=unique_id)
should_task_lock = task_self.redis_host is not None and task_self.redis_port is not None
return TaskLockParams(
redis_host=task_self.redis_host,
redis_port=task_self.redis_port,
redis_key=task_lock_key,
should_task_lock=should_task_lock,
redis_timeout=task_self.redis_timeout,
raise_task_lock_exception_on_collision=True,
lock_extend_seconds=lock_extend_seconds,
)
================================================
FILE: gokart/conflict_prevention_lock/task_lock_wrappers.py
================================================
from __future__ import annotations
import functools
from collections.abc import Callable
from logging import getLogger
from typing import ParamSpec, TypeVar
from gokart.conflict_prevention_lock.task_lock import TaskLockParams, set_lock_scheduler, set_task_lock
logger = getLogger(__name__)
P = ParamSpec('P')
R = TypeVar('R')
def wrap_dump_with_lock(func: Callable[P, R], task_lock_params: TaskLockParams, exist_check: Callable[..., bool]) -> Callable[P, R | None]:
"""Redis lock wrapper function for TargetOnKart.dump().
When TargetOnKart.dump() is called, dump() will be wrapped with redis lock and cache existance check.
https://github.com/m3dev/gokart/issues/265
"""
if not task_lock_params.should_task_lock:
return func
def wrapper(*args: P.args, **kwargs: P.kwargs) -> R | None:
task_lock = set_task_lock(task_lock_params=task_lock_params)
scheduler = set_lock_scheduler(task_lock=task_lock, task_lock_params=task_lock_params)
try:
logger.debug(f'Task DUMP lock of {task_lock_params.redis_key} locked.')
if not exist_check():
return func(*args, **kwargs)
return None
finally:
logger.debug(f'Task DUMP lock of {task_lock_params.redis_key} released.')
task_lock.release()
scheduler.shutdown()
return wrapper
def wrap_load_with_lock(func: Callable[P, R], task_lock_params: TaskLockParams) -> Callable[P, R]:
"""Redis lock wrapper function for TargetOnKart.load().
When TargetOnKart.load() is called, redis lock will be locked and released before load().
https://github.com/m3dev/gokart/issues/265
"""
if not task_lock_params.should_task_lock:
return func
def wrapper(*args: P.args, **kwargs: P.kwargs) -> R:
task_lock = set_task_lock(task_lock_params=task_lock_params)
scheduler = set_lock_scheduler(task_lock=task_lock, task_lock_params=task_lock_params)
logger.debug(f'Task LOAD lock of {task_lock_params.redis_key} locked.')
task_lock.release()
logger.debug(f'Task LOAD lock of {task_lock_params.redis_key} released.')
scheduler.shutdown()
result = func(*args, **kwargs)
return result
return wrapper
def wrap_remove_with_lock(func: Callable[P, R], task_lock_params: TaskLockParams) -> Callable[P, R]:
"""Redis lock wrapper function for TargetOnKart.remove().
When TargetOnKart.remove() is called, remove() will be simply wrapped with redis lock.
https://github.com/m3dev/gokart/issues/265
"""
if not task_lock_params.should_task_lock:
return func
def wrapper(*args: P.args, **kwargs: P.kwargs) -> R:
task_lock = set_task_lock(task_lock_params=task_lock_params)
scheduler = set_lock_scheduler(task_lock=task_lock, task_lock_params=task_lock_params)
try:
logger.debug(f'Task REMOVE lock of {task_lock_params.redis_key} locked.')
result = func(*args, **kwargs)
task_lock.release()
logger.debug(f'Task REMOVE lock of {task_lock_params.redis_key} released.')
scheduler.shutdown()
return result
except BaseException as e:
logger.debug(f'Task REMOVE lock of {task_lock_params.redis_key} released with BaseException.')
task_lock.release()
scheduler.shutdown()
raise e
return wrapper
def wrap_run_with_lock(run_func: Callable[[], R], task_lock_params: TaskLockParams) -> Callable[[], R]:
@functools.wraps(run_func)
def wrapped():
task_lock = set_task_lock(task_lock_params=task_lock_params)
scheduler = set_lock_scheduler(task_lock=task_lock, task_lock_params=task_lock_params)
try:
logger.debug(f'Task RUN lock of {task_lock_params.redis_key} locked.')
result = run_func()
task_lock.release()
logger.debug(f'Task RUN lock of {task_lock_params.redis_key} released.')
scheduler.shutdown()
return result
except BaseException as e:
logger.debug(f'Task RUN lock of {task_lock_params.redis_key} released with BaseException.')
task_lock.release()
scheduler.shutdown()
raise e
return wrapped
================================================
FILE: gokart/errors/__init__.py
================================================
from gokart.build import GokartBuildError, HasLockedTaskException
from gokart.pandas_type_config import PandasTypeError
from gokart.task import EmptyDumpError
__all__ = [
'GokartBuildError',
'HasLockedTaskException',
'PandasTypeError',
'EmptyDumpError',
]
================================================
FILE: gokart/file_processor/__init__.py
================================================
"""File processor module with support for multiple DataFrame backends."""
from __future__ import annotations
import os
from typing import Any, Literal
# Export common processors and types from base
from gokart.file_processor.base import (
BinaryFileProcessor,
DataFrameType,
FileProcessor,
GzipFileProcessor,
NpzFileProcessor,
PickleFileProcessor,
TextFileProcessor,
XmlFileProcessor,
)
# Import backend-specific implementations
from gokart.file_processor.pandas import (
CsvFileProcessorPandas,
FeatherFileProcessorPandas,
JsonFileProcessorPandas,
ParquetFileProcessorPandas,
)
from gokart.file_processor.polars import (
CsvFileProcessorPolars,
FeatherFileProcessorPolars,
JsonFileProcessorPolars,
ParquetFileProcessorPolars,
)
class CsvFileProcessor(FileProcessor):
"""CSV file processor with automatic backend selection based on dataframe_type."""
def __init__(self, sep: str = ',', encoding: str = 'utf-8', dataframe_type: DataFrameType = 'pandas') -> None:
"""
CSV file processor with support for both pandas and polars DataFrames.
Args:
sep: CSV delimiter (default: ',')
encoding: File encoding (default: 'utf-8')
dataframe_type: DataFrame library to use for load() - 'pandas', 'polars', or 'polars-lazy' (default: 'pandas')
"""
self._sep = sep
self._encoding = encoding
self._dataframe_type = dataframe_type # Store for tests
if dataframe_type == 'polars-lazy':
self._impl: FileProcessor = CsvFileProcessorPolars(sep=sep, encoding=encoding, lazy=True)
elif dataframe_type == 'polars':
self._impl = CsvFileProcessorPolars(sep=sep, encoding=encoding, lazy=False)
else:
self._impl = CsvFileProcessorPandas(sep=sep, encoding=encoding)
def format(self):
return self._impl.format()
def load(self, file):
return self._impl.load(file)
def dump(self, obj, file):
return self._impl.dump(obj, file)
class JsonFileProcessor(FileProcessor):
"""JSON file processor with automatic backend selection based on dataframe_type."""
def __init__(self, orient: Literal['split', 'records', 'index', 'table', 'columns', 'values'] | None = None, dataframe_type: DataFrameType = 'pandas'):
"""
JSON file processor with support for both pandas and polars DataFrames.
Args:
orient: JSON orientation. 'records' for newline-delimited JSON.
dataframe_type: DataFrame library to use for load() - 'pandas', 'polars', or 'polars-lazy' (default: 'pandas')
"""
self._orient = orient
self._dataframe_type = dataframe_type # Store for tests
if dataframe_type == 'polars-lazy':
self._impl: FileProcessor = JsonFileProcessorPolars(orient=orient, lazy=True)
elif dataframe_type == 'polars':
self._impl = JsonFileProcessorPolars(orient=orient, lazy=False)
else:
self._impl = JsonFileProcessorPandas(orient=orient)
def format(self):
return self._impl.format()
def load(self, file):
return self._impl.load(file)
def dump(self, obj, file):
return self._impl.dump(obj, file)
class ParquetFileProcessor(FileProcessor):
"""Parquet file processor with automatic backend selection based on dataframe_type."""
def __init__(self, engine: Any = 'pyarrow', compression: Any = None, dataframe_type: DataFrameType = 'pandas') -> None:
"""
Parquet file processor with support for both pandas and polars DataFrames.
Args:
engine: Parquet engine (pandas-specific, ignored for polars).
compression: Compression type.
dataframe_type: DataFrame library to use for load() - 'pandas', 'polars', or 'polars-lazy' (default: 'pandas')
"""
self._engine = engine
self._compression = compression
self._dataframe_type = dataframe_type # Store for tests
if dataframe_type == 'polars-lazy':
self._impl: FileProcessor = ParquetFileProcessorPolars(engine=engine, compression=compression, lazy=True)
elif dataframe_type == 'polars':
self._impl = ParquetFileProcessorPolars(engine=engine, compression=compression, lazy=False)
else:
self._impl = ParquetFileProcessorPandas(engine=engine, compression=compression)
def format(self):
return self._impl.format()
def load(self, file):
return self._impl.load(file)
def dump(self, obj, file):
# Use the configured implementation (pandas by default)
return self._impl.dump(obj, file)
class FeatherFileProcessor(FileProcessor):
"""Feather file processor with automatic backend selection based on dataframe_type."""
def __init__(self, store_index_in_feather: bool, dataframe_type: DataFrameType = 'pandas'):
"""
Feather file processor with support for both pandas and polars DataFrames.
Args:
store_index_in_feather: Whether to store pandas index (pandas-only feature).
dataframe_type: DataFrame library to use for load() - 'pandas', 'polars', or 'polars-lazy' (default: 'pandas')
"""
self._store_index_in_feather = store_index_in_feather
self._dataframe_type = dataframe_type # Store for tests
if dataframe_type == 'polars-lazy':
self._impl: FileProcessor = FeatherFileProcessorPolars(store_index_in_feather=store_index_in_feather, lazy=True)
elif dataframe_type == 'polars':
self._impl = FeatherFileProcessorPolars(store_index_in_feather=store_index_in_feather, lazy=False)
else:
self._impl = FeatherFileProcessorPandas(store_index_in_feather=store_index_in_feather)
def format(self):
return self._impl.format()
def load(self, file):
return self._impl.load(file)
def dump(self, obj, file):
# Use the configured implementation (pandas by default)
return self._impl.dump(obj, file)
def make_file_processor(file_path: str, store_index_in_feather: bool = True, *, dataframe_type: DataFrameType = 'pandas') -> FileProcessor:
"""Create a file processor based on file extension with default parameters."""
extension2processor = {
'.txt': TextFileProcessor(),
'.ini': TextFileProcessor(),
'.csv': CsvFileProcessor(sep=',', dataframe_type=dataframe_type),
'.tsv': CsvFileProcessor(sep='\t', dataframe_type=dataframe_type),
'.pkl': PickleFileProcessor(),
'.gz': GzipFileProcessor(),
'.json': JsonFileProcessor(dataframe_type=dataframe_type),
'.ndjson': JsonFileProcessor(dataframe_type=dataframe_type, orient='records'),
'.xml': XmlFileProcessor(),
'.npz': NpzFileProcessor(),
'.parquet': ParquetFileProcessor(compression='gzip', dataframe_type=dataframe_type),
'.feather': FeatherFileProcessor(store_index_in_feather=store_index_in_feather, dataframe_type=dataframe_type),
'.png': BinaryFileProcessor(),
'.jpg': BinaryFileProcessor(),
}
extension = os.path.splitext(file_path)[1]
assert extension in extension2processor, f'{extension} is not supported. The supported extensions are {list(extension2processor.keys())}.'
return extension2processor[extension]
__all__ = [
# Base classes and types
'FileProcessor',
'DataFrameType',
# Common processors
'BinaryFileProcessor',
'PickleFileProcessor',
'TextFileProcessor',
'GzipFileProcessor',
'XmlFileProcessor',
'NpzFileProcessor',
# DataFrame processors (with factory pattern)
'CsvFileProcessor',
'JsonFileProcessor',
'ParquetFileProcessor',
'FeatherFileProcessor',
# Utility functions
'make_file_processor',
]
================================================
FILE: gokart/file_processor/base.py
================================================
from __future__ import annotations
import xml.etree.ElementTree as ET
from abc import abstractmethod
from io import BytesIO
from logging import getLogger
from typing import Any, Literal, cast
import dill
import luigi
import luigi.format
import numpy as np
from gokart.utils import load_dill_with_pandas_backward_compatibility
logger = getLogger(__name__)
# Type alias for DataFrame library return type
DataFrameType = Literal['pandas', 'polars', 'polars-lazy']
class FileProcessor:
@abstractmethod
def format(self) -> Any: ...
@abstractmethod
def load(self, file: Any) -> Any: ...
@abstractmethod
def dump(self, obj: Any, file: Any) -> None: ...
class BinaryFileProcessor(FileProcessor):
"""
Pass bytes to this processor
```
figure_binary = io.BytesIO()
plt.savefig(figure_binary)
figure_binary.seek(0)
BinaryFileProcessor().dump(figure_binary.read())
```
"""
def format(self):
return luigi.format.Nop
def load(self, file):
return file.read()
def dump(self, obj, file):
file.write(obj)
class _ChunkedLargeFileReader:
def __init__(self, file: Any) -> None:
self._file = file
def __getattr__(self, item):
return getattr(self._file, item)
def read(self, n: int) -> bytes:
if n >= (1 << 31):
logger.info(f'reading a large file with total_bytes={n}.')
buffer = bytearray(n)
idx = 0
while idx < n:
batch_size = min(n - idx, (1 << 31) - 1)
logger.info(f'reading bytes [{idx}, {idx + batch_size})...')
buffer[idx : idx + batch_size] = self._file.read(batch_size)
idx += batch_size
logger.info('done.')
return bytes(buffer)
return cast(bytes, self._file.read(n))
def readline(self) -> bytes:
return cast(bytes, self._file.readline())
def seek(self, offset: int) -> None:
self._file.seek(offset)
def seekable(self) -> bool:
return cast(bool, self._file.seekable())
class PickleFileProcessor(FileProcessor):
def format(self):
return luigi.format.Nop
def load(self, file):
if not file.seekable():
# load_dill_with_pandas_backward_compatibility() requires file with seek() and readlines() implemented.
# Therefore, we need to wrap with BytesIO which makes file seekable and readlinesable.
# For example, ReadableS3File is not a seekable file.
return load_dill_with_pandas_backward_compatibility(BytesIO(file.read()))
return load_dill_with_pandas_backward_compatibility(_ChunkedLargeFileReader(file))
def dump(self, obj, file):
self._write(dill.dumps(obj, protocol=4), file)
@staticmethod
def _write(buffer, file):
n = len(buffer)
idx = 0
while idx < n:
logger.info(f'writing a file with total_bytes={n}...')
batch_size = min(n - idx, (1 << 31) - 1)
logger.info(f'writing bytes [{idx}, {idx + batch_size})')
file.write(buffer[idx : idx + batch_size])
idx += batch_size
logger.info('done')
class TextFileProcessor(FileProcessor):
def format(self):
return None
def load(self, file):
return [s.rstrip() for s in file.readlines()]
def dump(self, obj, file):
if isinstance(obj, list):
for x in obj:
file.write(str(x) + '\n')
else:
file.write(str(obj))
class GzipFileProcessor(FileProcessor):
def format(self):
return luigi.format.Gzip
def load(self, file):
return [s.rstrip().decode() for s in file.readlines()]
def dump(self, obj, file):
if isinstance(obj, list):
for x in obj:
file.write((str(x) + '\n').encode())
else:
file.write(str(obj).encode())
class XmlFileProcessor(FileProcessor):
def format(self):
return None
def load(self, file):
try:
return ET.parse(file)
except ET.ParseError:
return ET.ElementTree()
def dump(self, obj, file):
assert isinstance(obj, ET.ElementTree), f'requires ET.ElementTree, but {type(obj)} is passed.'
obj.write(file)
class NpzFileProcessor(FileProcessor):
def format(self):
return luigi.format.Nop
def load(self, file):
return np.load(file)['data']
def dump(self, obj, file):
assert isinstance(obj, np.ndarray), f'requires np.ndarray, but {type(obj)} is passed.'
np.savez_compressed(file, data=obj)
================================================
FILE: gokart/file_processor/pandas.py
================================================
"""Pandas-specific file processor implementations."""
from __future__ import annotations
from io import BytesIO
from typing import Literal
import luigi
import luigi.format
import pandas as pd
from luigi.format import TextFormat
from gokart.file_processor.base import FileProcessor
from gokart.object_storage import ObjectStorage
class CsvFileProcessorPandas(FileProcessor):
"""CSV file processor for pandas DataFrames."""
def __init__(self, sep: str = ',', encoding: str = 'utf-8') -> None:
self._sep = sep
self._encoding = encoding
super().__init__()
def format(self):
return TextFormat(encoding=self._encoding)
def load(self, file):
try:
return pd.read_csv(file, sep=self._sep, encoding=self._encoding)
except pd.errors.EmptyDataError:
return pd.DataFrame()
def dump(self, obj, file):
if not isinstance(obj, pd.DataFrame | pd.Series):
raise TypeError(f'requires pd.DataFrame or pd.Series, but {type(obj)} is passed.')
obj.to_csv(file, mode='wt', index=False, sep=self._sep, header=True, encoding=self._encoding)
_JsonOrient = Literal['split', 'records', 'index', 'table', 'columns', 'values']
class JsonFileProcessorPandas(FileProcessor):
"""JSON file processor for pandas DataFrames."""
def __init__(self, orient: _JsonOrient | None = None):
self._orient: _JsonOrient | None = orient
def format(self):
return luigi.format.Nop
def load(self, file):
try:
return pd.read_json(file, orient=self._orient, lines=True if self._orient == 'records' else False)
except pd.errors.EmptyDataError:
return pd.DataFrame()
def dump(self, obj, file):
if isinstance(obj, dict):
obj = pd.DataFrame.from_dict(obj)
if not isinstance(obj, pd.DataFrame | pd.Series):
raise TypeError(f'requires pd.DataFrame or pd.Series or dict, but {type(obj)} is passed.')
obj.to_json(file, orient=self._orient, lines=True if self._orient == 'records' else False)
class ParquetFileProcessorPandas(FileProcessor):
"""Parquet file processor for pandas DataFrames."""
def __init__(self, engine: Literal['auto', 'pyarrow', 'fastparquet'] = 'pyarrow', compression: str | None = None) -> None:
self._engine: Literal['auto', 'pyarrow', 'fastparquet'] = engine
self._compression = compression
super().__init__()
def format(self):
return luigi.format.Nop
def load(self, file):
# FIXME(mamo3gr): enable streaming (chunked) read with S3.
# pandas.read_parquet accepts file-like object
# but file (luigi.contrib.s3.ReadableS3File) should have 'tell' method,
# which is needed for pandas to read a file in chunks.
if ObjectStorage.is_buffered_reader(file):
return pd.read_parquet(file.name)
else:
return pd.read_parquet(BytesIO(file.read()))
def dump(self, obj, file):
if not isinstance(obj, pd.DataFrame):
raise TypeError(f'requires pd.DataFrame, but {type(obj)} is passed.')
# MEMO: to_parquet only supports a filepath as string (not a file handle)
obj.to_parquet(file.name, index=False, engine=self._engine, compression=self._compression)
class FeatherFileProcessorPandas(FileProcessor):
"""Feather file processor for pandas DataFrames."""
def __init__(self, store_index_in_feather: bool):
super().__init__()
self._store_index_in_feather = store_index_in_feather
self.INDEX_COLUMN_PREFIX = '__feather_gokart_index__'
def format(self):
return luigi.format.Nop
def load(self, file):
# FIXME(mamo3gr): enable streaming (chunked) read with S3.
# pandas.read_feather accepts file-like object
# but file (luigi.contrib.s3.ReadableS3File) should have 'tell' method,
# which is needed for pandas to read a file in chunks.
if ObjectStorage.is_buffered_reader(file):
loaded_df = pd.read_feather(file.name)
else:
loaded_df = pd.read_feather(BytesIO(file.read()))
if self._store_index_in_feather:
if any(col.startswith(self.INDEX_COLUMN_PREFIX) for col in loaded_df.columns):
index_columns = [col_name for col_name in loaded_df.columns[::-1] if col_name[: len(self.INDEX_COLUMN_PREFIX)] == self.INDEX_COLUMN_PREFIX]
index_column = index_columns[0]
index_name = index_column[len(self.INDEX_COLUMN_PREFIX) :]
if index_name == 'None':
index_name = None
loaded_df.index = pd.Index(loaded_df[index_column].values, name=index_name)
loaded_df = loaded_df.drop(columns=[index_column])
return loaded_df
def dump(self, obj, file):
if not isinstance(obj, pd.DataFrame):
raise TypeError(f'requires pd.DataFrame, but {type(obj)} is passed.')
dump_obj = obj.copy()
if self._store_index_in_feather:
index_column_name = f'{self.INDEX_COLUMN_PREFIX}{dump_obj.index.name}'
assert index_column_name not in dump_obj.columns, (
f'column name {index_column_name} already exists in dump_obj. \nConsider not saving index by setting store_index_in_feather=False.'
)
assert dump_obj.index.name != 'None', 'index name is "None", which is not allowed in gokart. Consider setting another index name.'
dump_obj[index_column_name] = dump_obj.index
dump_obj = dump_obj.reset_index(drop=True)
# to_feather supports "binary" file-like object, but file variable is text
dump_obj.to_feather(file.name)
================================================
FILE: gokart/file_processor/polars.py
================================================
"""Polars-specific file processor implementations."""
from __future__ import annotations
from io import BytesIO
from typing import TYPE_CHECKING, Literal
import luigi
import luigi.format
from luigi.format import TextFormat
from gokart.file_processor.base import FileProcessor
from gokart.object_storage import ObjectStorage
_CsvEncoding = Literal['utf8', 'utf8-lossy']
_ParquetCompression = Literal['lz4', 'uncompressed', 'snappy', 'gzip', 'brotli', 'zstd']
try:
import polars as pl
HAS_POLARS = True
except ImportError:
HAS_POLARS = False
if TYPE_CHECKING:
import polars as pl
class CsvFileProcessorPolars(FileProcessor):
"""CSV file processor for polars DataFrames."""
def __init__(self, sep: str = ',', encoding: str = 'utf-8', lazy: bool = False) -> None:
if not HAS_POLARS:
raise ImportError("polars is required for polars-based dataframe types ('polars' or 'polars-lazy'). Install with: pip install polars")
self._sep = sep
self._encoding = encoding
self._lazy = lazy
super().__init__()
def format(self):
return TextFormat(encoding=self._encoding)
def load(self, file):
try:
# scan_csv/read_csv only support 'utf8' and 'utf8-lossy'
encoding: _CsvEncoding = 'utf8' if self._encoding in ('utf-8', 'utf8') else 'utf8-lossy'
if self._lazy:
# scan_csv requires a file path, not a file object
return pl.scan_csv(file.name, separator=self._sep, encoding=encoding)
return pl.read_csv(file, separator=self._sep, encoding=encoding)
except Exception as e:
# Handle empty data gracefully
if 'empty' in str(e).lower() or 'no data' in str(e).lower():
return pl.LazyFrame() if self._lazy else pl.DataFrame()
raise
def dump(self, obj, file):
if isinstance(obj, pl.LazyFrame):
obj = obj.collect()
if not isinstance(obj, pl.DataFrame):
raise TypeError(f'requires pl.DataFrame or pl.LazyFrame, but {type(obj)} is passed.')
obj.write_csv(file, separator=self._sep, include_header=True)
class JsonFileProcessorPolars(FileProcessor):
"""JSON file processor for polars DataFrames."""
def __init__(self, orient: str | None = None, lazy: bool = False):
if not HAS_POLARS:
raise ImportError("polars is required for polars-based dataframe types ('polars' or 'polars-lazy'). Install with: pip install polars")
self._orient = orient
self._lazy = lazy
def format(self):
return luigi.format.Nop
def load(self, file):
try:
if self._orient == 'records':
if self._lazy:
return pl.scan_ndjson(file)
return pl.read_ndjson(file)
else:
# polars doesn't have scan_json, so we read and convert if lazy
df = pl.read_json(file)
return df.lazy() if self._lazy else df
except Exception as e:
# Handle empty files
if 'empty' in str(e).lower() or 'no data' in str(e).lower():
return pl.LazyFrame() if self._lazy else pl.DataFrame()
raise
def dump(self, obj, file):
if isinstance(obj, pl.LazyFrame):
obj = obj.collect()
if not isinstance(obj, pl.DataFrame):
raise TypeError(f'requires pl.DataFrame or pl.LazyFrame, but {type(obj)} is passed.')
if self._orient == 'records':
obj.write_ndjson(file)
else:
obj.write_json(file)
class ParquetFileProcessorPolars(FileProcessor):
"""Parquet file processor for polars DataFrames."""
def __init__(self, engine: str = 'pyarrow', compression: _ParquetCompression | None = None, lazy: bool = False) -> None:
if not HAS_POLARS:
raise ImportError("polars is required for polars-based dataframe types ('polars' or 'polars-lazy'). Install with: pip install polars")
self._engine = engine # Ignored for polars
self._compression: _ParquetCompression | None = compression
self._lazy = lazy
super().__init__()
def format(self):
return luigi.format.Nop
def load(self, file):
# polars.read_parquet can handle file paths or file-like objects
if ObjectStorage.is_buffered_reader(file):
if self._lazy:
return pl.scan_parquet(file.name)
return pl.read_parquet(file.name)
else:
data = BytesIO(file.read())
if self._lazy:
# scan_parquet doesn't work with BytesIO, so read and convert
return pl.read_parquet(data).lazy()
return pl.read_parquet(data)
def dump(self, obj, file):
if isinstance(obj, pl.LazyFrame):
obj = obj.collect()
if not isinstance(obj, pl.DataFrame):
raise TypeError(f'requires pl.DataFrame or pl.LazyFrame, but {type(obj)} is passed.')
# polars write_parquet requires a file path; default to 'zstd' when compression is None
obj.write_parquet(file.name, compression=self._compression or 'zstd')
class FeatherFileProcessorPolars(FileProcessor):
"""Feather file processor for polars DataFrames."""
def __init__(self, store_index_in_feather: bool, lazy: bool = False):
if not HAS_POLARS:
raise ImportError("polars is required for polars-based dataframe types ('polars' or 'polars-lazy'). Install with: pip install polars")
super().__init__()
self._store_index_in_feather = store_index_in_feather # Ignored for polars
self._lazy = lazy
def format(self):
return luigi.format.Nop
def load(self, file):
# polars uses read_ipc for feather format
if ObjectStorage.is_buffered_reader(file):
if self._lazy:
return pl.scan_ipc(file.name)
return pl.read_ipc(file.name)
else:
data = BytesIO(file.read())
if self._lazy:
# scan_ipc doesn't work with BytesIO, so read and convert
return pl.read_ipc(data).lazy()
return pl.read_ipc(data)
def dump(self, obj, file):
if isinstance(obj, pl.LazyFrame):
obj = obj.collect()
if not isinstance(obj, pl.DataFrame):
raise TypeError(f'requires pl.DataFrame or pl.LazyFrame, but {type(obj)} is passed.')
# polars uses write_ipc for feather format
# Note: store_index_in_feather is ignored for polars as it's pandas-specific
obj.write_ipc(file.name)
================================================
FILE: gokart/file_processor.py
================================================
================================================
FILE: gokart/gcs_config.py
================================================
from __future__ import annotations
import json
import os
from typing import cast
import luigi
import luigi.contrib.gcs
from google.oauth2.service_account import Credentials
class GCSConfig(luigi.Config):
gcs_credential_name: luigi.StrParameter = luigi.StrParameter(default='GCS_CREDENTIAL', description='GCS credential environment variable.')
_client = None
def get_gcs_client(self) -> luigi.contrib.gcs.GCSClient:
if self._client is None: # use cache as like singleton object
self._client = self._get_gcs_client()
return self._client
def _get_gcs_client(self) -> luigi.contrib.gcs.GCSClient:
return luigi.contrib.gcs.GCSClient(oauth_credentials=self._load_oauth_credentials())
def _load_oauth_credentials(self) -> Credentials | None:
json_str = os.environ.get(self.gcs_credential_name)
if not json_str:
return None
if os.path.isfile(json_str):
return cast(Credentials, Credentials.from_service_account_file(json_str))
return cast(Credentials, Credentials.from_service_account_info(json.loads(json_str)))
================================================
FILE: gokart/gcs_obj_metadata_client.py
================================================
from __future__ import annotations
import copy
import functools
import json
import re
from collections.abc import Iterable
from logging import getLogger
from typing import Any, Final
from urllib.parse import urlsplit
from googleapiclient.model import makepatch
from gokart.gcs_config import GCSConfig
from gokart.required_task_output import RequiredTaskOutput
from gokart.utils import FlattenableItems
logger = getLogger(__name__)
class GCSObjectMetadataClient:
"""
This class is Utility-Class, so should not be initialized.
This class used for adding metadata as labels.
"""
# Maximum metadata size for GCS objects (8 KiB)
MAX_GCS_METADATA_SIZE: Final[int] = 8 * 1024
@staticmethod
def _is_log_related_path(path: str) -> bool:
return re.match(r'^gs://.+?/log/(processing_time/|task_info/|task_log/|module_versions/|random_seed/|task_params/).+', path) is not None
# This is the copied method of luigi.gcs._path_to_bucket_and_key(path).
@staticmethod
def _path_to_bucket_and_key(path: str) -> tuple[str, str]:
(scheme, netloc, path, _, _) = urlsplit(path)
assert scheme == 'gs'
path_without_initial_slash = path[1:]
return netloc, path_without_initial_slash
@staticmethod
def add_task_state_labels(
path: str,
task_params: dict[str, str] | None = None,
custom_labels: dict[str, str] | None = None,
required_task_outputs: FlattenableItems[RequiredTaskOutput] | None = None,
) -> None:
if GCSObjectMetadataClient._is_log_related_path(path):
return
# In gokart/object_storage.get_time_stamp, could find same call.
# _path_to_bucket_and_key is a private method, so, this might not be acceptable.
bucket, obj = GCSObjectMetadataClient._path_to_bucket_and_key(path)
_response = GCSConfig().get_gcs_client().client.objects().get(bucket=bucket, object=obj).execute()
if _response is None:
logger.error(f'failed to get object from GCS bucket {bucket} and object {obj}.')
return
response: dict[str, Any] = dict(_response)
original_metadata: dict[Any, Any] = {}
if 'metadata' in response.keys():
_metadata = response.get('metadata')
if _metadata is not None:
original_metadata = dict(_metadata)
patched_metadata = GCSObjectMetadataClient._get_patched_obj_metadata(
copy.deepcopy(original_metadata),
task_params,
custom_labels,
required_task_outputs,
)
if original_metadata != patched_metadata:
# If we use update api, existing object metadata are removed, so should use patch api.
# See the official document descriptions.
# [Link] https://cloud.google.com/storage/docs/viewing-editing-metadata?hl=ja#rest-set-object-metadata
update_response = (
GCSConfig()
.get_gcs_client()
.client.objects()
.patch(
bucket=bucket,
object=obj,
body=makepatch({'metadata': original_metadata}, {'metadata': patched_metadata}),
)
.execute()
)
if update_response is None:
logger.error(f'failed to patch object {obj} in bucket {bucket} and object {obj}.')
@staticmethod
def _normalize_labels(labels: dict[str, Any] | None) -> dict[str, str]:
return {str(key): str(value) for key, value in labels.items()} if labels else {}
@staticmethod
def _get_patched_obj_metadata(
metadata: Any,
task_params: dict[str, str] | None = None,
custom_labels: dict[str, str] | None = None,
required_task_outputs: FlattenableItems[RequiredTaskOutput] | None = None,
) -> dict[str, Any] | Any:
# If metadata from response when getting bucket and object information is not dictionary,
# something wrong might be happened, so return original metadata, no patched.
if not isinstance(metadata, dict):
logger.warning(f'metadata is not a dict: {metadata}, something wrong was happened when getting response when get bucket and object information.')
return metadata
# Maximum size of metadata for each object is 8 KiB.
# [Link]: https://cloud.google.com/storage/quotas#objects
normalized_task_params_labels = GCSObjectMetadataClient._normalize_labels(task_params)
normalized_custom_labels = GCSObjectMetadataClient._normalize_labels(custom_labels)
# There is a possibility that the keys of user-provided labels(custom_labels) may conflict with those generated from task parameters (task_params_labels).
# However, users who utilize custom_labels are no longer expected to search using the labels generated from task parameters.
# Instead, users are expected to search using the labels they provided.
# Therefore, in the event of a key conflict, the value registered by the user-provided labels will take precedence.
normalized_labels = [normalized_custom_labels, normalized_task_params_labels]
if required_task_outputs:
normalized_labels.append({'__required_task_outputs': json.dumps(GCSObjectMetadataClient._get_serialized_string(required_task_outputs))})
_merged_labels = GCSObjectMetadataClient._merge_custom_labels_and_task_params_labels(normalized_labels)
return GCSObjectMetadataClient._adjust_gcs_metadata_limit_size(dict(metadata) | _merged_labels)
@staticmethod
def _get_serialized_string(required_task_outputs: FlattenableItems[RequiredTaskOutput]) -> FlattenableItems[str]:
if isinstance(required_task_outputs, RequiredTaskOutput):
return required_task_outputs.serialize()
elif isinstance(required_task_outputs, dict):
return {k: GCSObjectMetadataClient._get_serialized_string(v) for k, v in required_task_outputs.items()}
elif isinstance(required_task_outputs, Iterable):
return [GCSObjectMetadataClient._get_serialized_string(ro) for ro in required_task_outputs]
else:
raise TypeError(
f'Unsupported type for required_task_outputs: {type(required_task_outputs)}. '
'It should be RequiredTaskOutput, dict, or iterable of RequiredTaskOutput.'
)
@staticmethod
def _merge_custom_labels_and_task_params_labels(
normalized_labels_list: list[dict[str, str]],
) -> dict[str, str]:
def __merge_two_dicts_helper(merged: dict[str, str], current_labels: dict[str, str]) -> dict[str, str]:
next_merged = copy.deepcopy(merged)
for label_name, label_value in current_labels.items():
if len(label_value) == 0:
logger.warning(f'value of label_name={label_name} is empty. So skip to add as a metadata.')
continue
if label_name in next_merged:
logger.warning(f'label_name={label_name} is already seen. So skip to add as metadata.')
continue
next_merged[label_name] = label_value
return next_merged
return functools.reduce(__merge_two_dicts_helper, normalized_labels_list, {})
# Google Cloud Storage(GCS) has a limitation of metadata size, 8 KiB.
# So, we need to adjust the size of metadata.
@staticmethod
def _adjust_gcs_metadata_limit_size(_labels: dict[str, str]) -> dict[str, str]:
def _get_label_size(label_name: str, label_value: str) -> int:
return len(label_name.encode('utf-8')) + len(label_value.encode('utf-8'))
labels = copy.deepcopy(_labels)
max_gcs_metadata_size, current_total_metadata_size = (
GCSObjectMetadataClient.MAX_GCS_METADATA_SIZE,
sum(_get_label_size(label_name, label_value) for label_name, label_value in labels.items()),
)
if current_total_metadata_size <= max_gcs_metadata_size:
return labels
# NOTE: remove labels to stay within max metadata size.
to_remove = []
for label_name, label_value in reversed(tuple(labels.items())):
size = _get_label_size(label_name, label_value)
to_remove.append(label_name)
current_total_metadata_size -= size
if current_total_metadata_size <= max_gcs_metadata_size:
break
for key in to_remove:
del labels[key]
return labels
================================================
FILE: gokart/gcs_zip_client.py
================================================
from __future__ import annotations
import os
import shutil
from typing import cast
from gokart.gcs_config import GCSConfig
from gokart.zip_client import ZipClient, _unzip_file
class GCSZipClient(ZipClient):
def __init__(self, file_path: str, temporary_directory: str) -> None:
self._file_path = file_path
self._temporary_directory = temporary_directory
self._client = GCSConfig().get_gcs_client()
def exists(self) -> bool:
return cast(bool, self._client.exists(self._file_path))
def make_archive(self) -> None:
extension = os.path.splitext(self._file_path)[1]
shutil.make_archive(base_name=self._temporary_directory, format=extension[1:], root_dir=self._temporary_directory)
self._client.put(self._temporary_file_path(), self._file_path)
def unpack_archive(self) -> None:
os.makedirs(self._temporary_directory, exist_ok=True)
file_pointer = self._client.download(self._file_path)
_unzip_file(fp=file_pointer, extract_dir=self._temporary_directory)
def remove(self) -> None:
self._client.remove(self._file_path)
@property
def path(self) -> str:
return self._file_path
def _temporary_file_path(self):
extension = os.path.splitext(self._file_path)[1]
base_name = self._temporary_directory
if base_name.endswith('/'):
base_name = base_name[:-1]
return base_name + extension
================================================
FILE: gokart/in_memory/__init__.py
================================================
__all__ = [
'InMemoryCacheRepository',
'InMemoryTarget',
'make_in_memory_target',
]
from .repository import InMemoryCacheRepository
from .target import InMemoryTarget, make_in_memory_target
================================================
FILE: gokart/in_memory/data.py
================================================
from __future__ import annotations
from dataclasses import dataclass
from datetime import datetime
from typing import Any
@dataclass
class InMemoryData:
value: Any
last_modification_time: datetime
@classmethod
def create_data(self, value: Any) -> InMemoryData:
return InMemoryData(value=value, last_modification_time=datetime.now())
================================================
FILE: gokart/in_memory/repository.py
================================================
from __future__ import annotations
from collections.abc import Iterator
from datetime import datetime
from typing import Any
from .data import InMemoryData
class InMemoryCacheRepository:
_cache: dict[str, InMemoryData] = {}
def __init__(self):
pass
def get_value(self, key: str) -> Any:
return self._get_data(key).value
def get_last_modification_time(self, key: str) -> datetime:
return self._get_data(key).last_modification_time
def _get_data(self, key: str) -> InMemoryData:
return self._cache[key]
def set_value(self, key: str, obj: Any) -> None:
data = InMemoryData.create_data(obj)
self._cache[key] = data
def has(self, key: str) -> bool:
return key in self._cache
def remove(self, key: str) -> None:
assert self.has(key), f'{key} does not exist.'
del self._cache[key]
def empty(self) -> bool:
return not self._cache
def clear(self) -> None:
self._cache.clear()
def get_gen(self) -> Iterator[tuple[str, Any]]:
for key, data in self._cache.items():
yield key, data.value
@property
def size(self) -> int:
return len(self._cache)
================================================
FILE: gokart/in_memory/target.py
================================================
from __future__ import annotations
from datetime import datetime
from typing import Any
from gokart.in_memory.repository import InMemoryCacheRepository
from gokart.required_task_output import RequiredTaskOutput
from gokart.target import TargetOnKart, TaskLockParams
from gokart.utils import FlattenableItems
_repository = InMemoryCacheRepository()
class InMemoryTarget(TargetOnKart):
def __init__(self, data_key: str, task_lock_param: TaskLockParams):
if task_lock_param.should_task_lock:
raise ValueError('Redis with `InMemoryTarget` is not currently supported.')
self._data_key = data_key
self._task_lock_params = task_lock_param
def _exists(self) -> bool:
return _repository.has(self._data_key)
def _get_task_lock_params(self) -> TaskLockParams:
return self._task_lock_params
def _load(self) -> Any:
return _repository.get_value(self._data_key)
def _dump(
self,
obj: Any,
task_params: dict[str, str] | None = None,
custom_labels: dict[str, str] | None = None,
required_task_outputs: FlattenableItems[RequiredTaskOutput] | None = None,
) -> None:
return _repository.set_value(self._data_key, obj)
def _remove(self) -> None:
_repository.remove(self._data_key)
def _last_modification_time(self) -> datetime:
if not _repository.has(self._data_key):
raise ValueError(f'No object(s) which id is {self._data_key} are stored before.')
time = _repository.get_last_modification_time(self._data_key)
return time
def _path(self) -> str:
# TODO: this module name `_path` migit not be appropriate
return self._data_key
def make_in_memory_target(target_key: str, task_lock_params: TaskLockParams) -> InMemoryTarget:
return InMemoryTarget(target_key, task_lock_params)
================================================
FILE: gokart/info.py
================================================
from __future__ import annotations
from logging import getLogger
from typing import Any
import luigi
from gokart.task import TaskOnKart
from gokart.tree.task_info import make_task_info_as_tree_str
logger = getLogger(__name__)
def make_tree_info(
task: TaskOnKart[Any],
indent: str = '',
last: bool = True,
details: bool = False,
abbr: bool = True,
visited_tasks: set[str] | None = None,
ignore_task_names: list[str] | None = None,
) -> str:
"""
Return a string representation of the tasks, their statuses/parameters in a dependency tree format
This function has moved to `gokart.tree.task_info.make_task_info_as_tree_str`.
This code is remained for backward compatibility.
Parameters
----------
- task: TaskOnKart
Root task.
- details: bool
Whether or not to output details.
- abbr: bool
Whether or not to simplify tasks information that has already appeared.
- ignore_task_names: list[str] | None
List of task names to ignore.
Returns
-------
- tree_info : str
Formatted task dependency tree.
"""
return make_task_info_as_tree_str(task=task, details=details, abbr=abbr, ignore_task_names=ignore_task_names)
class tree_info(TaskOnKart[Any]):
mode: luigi.StrParameter = luigi.StrParameter(default='', description='This must be in ["simple", "all"].')
output_path: luigi.StrParameter = luigi.StrParameter(default='tree.txt', description='Output file path.')
def output(self):
return self.make_target(self.output_path, use_unique_id=False)
================================================
FILE: gokart/mypy.py
================================================
"""Plugin that provides support for gokart.TaskOnKart.
This Code reuses the code from mypy.plugins.dataclasses
https://github.com/python/mypy/blob/0753e2a82dad35034e000609b6e8daa37238bfaa/mypy/plugins/dataclasses.py
"""
from __future__ import annotations
import re
import sys
import warnings
from collections.abc import Callable, Iterator
from dataclasses import dataclass
from enum import Enum
from typing import Any, Final, Literal
import luigi
from mypy.expandtype import expand_type
from mypy.nodes import (
ARG_NAMED,
ARG_NAMED_OPT,
ArgKind,
Argument,
AssignmentStmt,
Block,
CallExpr,
ClassDef,
EllipsisExpr,
Expression,
IfStmt,
JsonDict,
MemberExpr,
NameExpr,
PlaceholderNode,
RefExpr,
Statement,
TempNode,
TypeInfo,
Var,
)
from mypy.options import Options
from mypy.plugin import ClassDefContext, FunctionContext, Plugin, SemanticAnalyzerPluginInterface
from mypy.plugins.common import (
add_method_to_class,
deserialize_and_fixup_type,
)
from mypy.server.trigger import make_wildcard_trigger
from mypy.state import state
from mypy.typeops import map_type_from_supertype
from mypy.types import (
AnyType,
Instance,
NoneType,
Type,
TypeOfAny,
UnionType,
)
from mypy.typevars import fill_typevars
METADATA_TAG: Final[str] = 'task_on_kart'
PARAMETER_FULLNAME_MATCHER: Final = re.compile(r'^(gokart|luigi)(\.parameter)?\.\w*Parameter$')
PARAMETER_TMP_MATCHER: Final = re.compile(r'^\w*Parameter$')
class PluginOptions(Enum):
DISALLOW_MISSING_PARAMETERS = 'disallow_missing_parameters'
@dataclass
class TaskOnKartPluginOptions:
# Whether to error on missing parameters in the constructor.
# Some projects use luigi.Config to set parameters, which does not require parameters to be explicitly passed to the constructor.
disallow_missing_parameters: bool = False
@classmethod
def _parse_toml(cls, config_file: str) -> dict[str, Any]:
if sys.version_info >= (3, 11):
import tomllib as toml_
else:
try:
import tomli as toml_
except ImportError: # pragma: no cover
warnings.warn('install tomli to parse pyproject.toml under Python 3.10', stacklevel=1)
return {}
with open(config_file, 'rb') as f:
return toml_.load(f)
@classmethod
def parse_config_file(cls, config_file: str) -> TaskOnKartPluginOptions:
# TODO: support other configuration file formats if necessary.
if not config_file.endswith('.toml'):
warnings.warn('gokart mypy plugin can be configured by pyproject.toml', stacklevel=1)
return cls()
config = cls._parse_toml(config_file)
gokart_plugin_config = config.get('tool', {}).get('gokart-mypy', {})
disallow_missing_parameters = gokart_plugin_config.get(PluginOptions.DISALLOW_MISSING_PARAMETERS.value, False)
if not isinstance(disallow_missing_parameters, bool):
raise ValueError(f'{PluginOptions.DISALLOW_MISSING_PARAMETERS.value} must be a boolean value')
return cls(disallow_missing_parameters=disallow_missing_parameters)
class TaskOnKartPlugin(Plugin):
def __init__(self, options: Options) -> None:
super().__init__(options)
if options.config_file is not None:
self._options = TaskOnKartPluginOptions.parse_config_file(options.config_file)
else:
self._options = TaskOnKartPluginOptions()
def get_base_class_hook(self, fullname: str) -> Callable[[ClassDefContext], None] | None:
# The following gathers attributes from gokart.TaskOnKart such as `workspace_directory`
# the transformation does not affect because the class has `__init__` method of `gokart.TaskOnKart`.
#
# NOTE: `gokart.task.luigi.Task` condition is required for the release of luigi versions without py.typed
if fullname in {'gokart.task.luigi.Task', 'luigi.task.Task'}:
return self._task_on_kart_class_maker_callback
sym = self.lookup_fully_qualified(fullname)
if sym and isinstance(sym.node, TypeInfo):
if any(base.fullname == 'gokart.task.TaskOnKart' for base in sym.node.mro):
return self._task_on_kart_class_maker_callback
return None
def get_function_hook(self, fullname: str) -> Callable[[FunctionContext], Type] | None:
"""Adjust the return type of the `Parameters` function."""
if PARAMETER_FULLNAME_MATCHER.match(fullname):
return self._task_on_kart_parameter_field_callback
return None
def _task_on_kart_class_maker_callback(self, ctx: ClassDefContext) -> None:
transformer = TaskOnKartTransformer(ctx.cls, ctx.reason, ctx.api, self._options)
transformer.transform()
def _task_on_kart_parameter_field_callback(self, ctx: FunctionContext) -> Type:
"""Extract the type of the `default` argument from the Field function, and use it as the return type.
In particular:
* Retrieve the type of the argument which is specified, and use it as return type for the function.
* If no default argument is specified, return AnyType with unannotated type instead of parameter types like `luigi.Parameter()`
This makes mypy avoid conflict between the type annotation and the parameter type.
e.g.
```python
foo: int = luigi.IntParameter()
```
"""
try:
default_idx = ctx.callee_arg_names.index('default')
# if no `default` argument is found, return AnyType with unannotated type.
except ValueError:
return AnyType(TypeOfAny.unannotated)
default_args = ctx.args[default_idx]
if default_args:
default_type = ctx.arg_types[0][0]
default_arg = default_args[0]
# Fallback to default Any type if the field is required
if not isinstance(default_arg, EllipsisExpr):
return default_type
# NOTE: This is a workaround to avoid the error between type annotation and parameter type.
# As the following code snippet, the type of `foo` is `int` but the assigned value is `luigi.IntParameter()`.
# foo: int = luigi.IntParameter()
# TODO: infer mypy type from the parameter type.
return AnyType(TypeOfAny.unannotated)
class TaskOnKartAttribute:
def __init__(
self,
name: str,
has_default: bool,
line: int,
column: int,
type: Type | None,
info: TypeInfo,
api: SemanticAnalyzerPluginInterface,
options: TaskOnKartPluginOptions,
) -> None:
self.name = name
self.has_default = has_default
self.line = line
self.column = column
self.type = type # Type as __init__ argument
self.info = info
self._api = api
self._options = options
def to_argument(self, current_info: TypeInfo, *, of: Literal['__init__',]) -> Argument:
if of == '__init__':
arg_kind = self._get_arg_kind_by_options()
return Argument(
variable=self.to_var(current_info),
type_annotation=self.expand_type(current_info),
initializer=EllipsisExpr() if self.has_default else None, # Only used by stubgen
kind=arg_kind,
)
def expand_type(self, current_info: TypeInfo) -> Type | None:
if self.type is not None and self.info.self_type is not None:
# In general, it is not safe to call `expand_type()` during semantic analysis,
# however this plugin is called very late, so all types should be fully ready.
# Also, it is tricky to avoid eager expansion of Self types here (e.g. because
# we serialize attributes).
with state.strict_optional_set(self._api.options.strict_optional):
return expand_type(self.type, {self.info.self_type.id: fill_typevars(current_info)})
return self.type
def to_var(self, current_info: TypeInfo) -> Var:
return Var(self.name, self.expand_type(current_info))
def serialize(self) -> JsonDict:
assert self.type
return {
'name': self.name,
'has_default': self.has_default,
'line': self.line,
'column': self.column,
'type': self.type.serialize(),
}
@classmethod
def deserialize(cls, info: TypeInfo, data: JsonDict, api: SemanticAnalyzerPluginInterface, options: TaskOnKartPluginOptions) -> TaskOnKartAttribute:
data = data.copy()
typ = deserialize_and_fixup_type(data.pop('type'), api)
return cls(type=typ, info=info, **data, api=api, options=options)
def expand_typevar_from_subtype(self, sub_type: TypeInfo) -> None:
"""Expands type vars in the context of a subtype when an attribute is inherited
from a generic super type."""
if self.type is not None:
with state.strict_optional_set(self._api.options.strict_optional):
self.type = map_type_from_supertype(self.type, sub_type, self.info)
def _get_arg_kind_by_options(self) -> Literal[ArgKind.ARG_NAMED, ArgKind.ARG_NAMED_OPT]:
"""Set the argument kind based on the options.
if `disallow_missing_parameters` is True, the argument kind is `ARG_NAMED` when the attribute has no default value.
This means the that all the parameters are passed to the constructor as keyword-only arguments.
Returns:
Literal[ArgKind.ARG_NAMED, ArgKind.ARG_NAMED_OPT]: The argument kind.
"""
if not self._options.disallow_missing_parameters:
return ARG_NAMED_OPT
if self.has_default:
return ARG_NAMED_OPT
# required parameter
return ARG_NAMED
class TaskOnKartTransformer:
"""Implement the behavior of gokart.TaskOnKart."""
def __init__(
self,
cls: ClassDef,
reason: Expression | Statement,
api: SemanticAnalyzerPluginInterface,
options: TaskOnKartPluginOptions,
) -> None:
self._cls = cls
self._reason = reason
self._api = api
self._options = options
def transform(self) -> bool:
"""Apply all the necessary transformations to the underlying gokart.TaskOnKart"""
info = self._cls.info
attributes = self.collect_attributes()
if attributes is None:
# Some definitions are not ready. We need another pass.
return False
for attr in attributes:
if attr.type is None:
return False
# If there are no attributes, it may be that the semantic analyzer has not
# processed them yet. In order to work around this, we can simply skip generating
# __init__ if there are no attributes, because if the user truly did not define any,
# then the object default __init__ with an empty signature will be present anyway.
if ('__init__' not in info.names or info.names['__init__'].plugin_generated) and attributes:
args = [attr.to_argument(info, of='__init__') for attr in attributes]
add_method_to_class(self._api, self._cls, '__init__', args=args, return_type=NoneType())
info.metadata[METADATA_TAG] = {
'attributes': [attr.serialize() for attr in attributes],
}
return True
def _get_assignment_statements_from_if_statement(self, stmt: IfStmt) -> Iterator[AssignmentStmt]:
for body in stmt.body:
if not body.is_unreachable:
yield from self._get_assignment_statements_from_block(body)
if stmt.else_body is not None and not stmt.else_body.is_unreachable:
yield from self._get_assignment_statements_from_block(stmt.else_body)
def _get_assignment_statements_from_block(self, block: Block) -> Iterator[AssignmentStmt]:
for stmt in block.body:
if isinstance(stmt, AssignmentStmt):
yield stmt
elif isinstance(stmt, IfStmt):
yield from self._get_assignment_statements_from_if_statement(stmt)
def collect_attributes(self) -> list[TaskOnKartAttribute] | None:
"""Collect all attributes declared in the task and its parents.
All assignments of the form
a: SomeType
b: SomeOtherType = ...
are collected.
Return None if some base class hasn't been processed
yet and thus we'll need to ask for another pass.
"""
cls = self._cls
# First, collect attributes belonging to any class in the MRO, ignoring duplicates.
#
# We iterate through the MRO in reverse because attrs defined in the parent must appear
# earlier in the attributes list than attrs defined in the child.
#
# However, we also want attributes defined in the subtype to override ones defined
# in the parent. We can implement this via a dict without disrupting the attr order
# because dicts preserve insertion order in Python 3.7+.
found_attrs: dict[str, TaskOnKartAttribute] = {}
for info in reversed(cls.info.mro[1:-1]):
if METADATA_TAG not in info.metadata:
continue
# Each class depends on the set of attributes in its task_on_kart ancestors.
self._api.add_plugin_dependency(make_wildcard_trigger(info.fullname))
for data in info.metadata[METADATA_TAG]['attributes']:
name: str = data['name']
attr = TaskOnKartAttribute.deserialize(info, data, self._api, self._options)
# TODO: We shouldn't be performing type operations during the main
# semantic analysis pass, since some TypeInfo attributes might
# still be in flux. This should be performed in a later phase.
attr.expand_typevar_from_subtype(cls.info)
found_attrs[name] = attr
sym_node = cls.info.names.get(name)
if sym_node and sym_node.node and not isinstance(sym_node.node, Var):
self._api.fail(
'TaskOnKart attribute may only be overridden by another attribute',
sym_node.node,
)
# Second, collect attributes belonging to the current class.
current_attr_names: set[str] = set()
for stmt in self._get_assignment_statements_from_block(cls.defs):
if not is_parameter_call(stmt.rvalue):
continue
# a: int, b: str = 1, 'foo' is not supported syntax so we
# don't have to worry about it.
lhs = stmt.lvalues[0]
if not isinstance(lhs, NameExpr):
continue
sym = cls.info.names.get(lhs.name)
if sym is None:
# There was probably a semantic analysis error.
continue
node = sym.node
assert not isinstance(node, PlaceholderNode)
assert isinstance(node, Var)
has_parameter_call, parameter_args = self._collect_parameter_args(stmt.rvalue)
has_default = False
# Ensure that something like x: int = field() is rejected
# after an attribute with a default.
if has_parameter_call:
has_default = 'default' in parameter_args
# All other assignments are already type checked.
elif not isinstance(stmt.rvalue, TempNode):
has_default = True
if not has_default:
# Make all non-default task_on_kart attributes implicit because they are de-facto
# set on self in the generated __init__(), not in the class body. On the other
# hand, we don't know how custom task_on_kart transforms initialize attributes,
# so we don't treat them as implicit. This is required to support descriptors
# (https://github.com/python/mypy/issues/14868).
sym.implicit = True
current_attr_names.add(lhs.name)
with state.strict_optional_set(self._api.options.strict_optional):
init_type = sym.type
# infer Parameter type
if init_type is None:
init_type = self._infer_type_from_parameters(stmt.rvalue)
found_attrs[lhs.name] = TaskOnKartAttribute(
name=lhs.name,
has_default=has_default,
line=stmt.line,
column=stmt.column,
type=init_type,
info=cls.info,
api=self._api,
options=self._options,
)
return list(found_attrs.values())
def _collect_parameter_args(self, expr: Expression) -> tuple[bool, dict[str, Expression]]:
"""Returns a tuple where the first value represents whether or not
the expression is a call to luigi.Parameter() or gokart.TaskInstanceParameter()
and the second value is a dictionary of the keyword arguments that luigi.Parameter() or gokart.TaskInstanceParameter() was called with.
"""
if isinstance(expr, CallExpr) and isinstance(expr.callee, RefExpr):
args = {}
for name, arg in zip(expr.arg_names, expr.args, strict=False):
if name is None:
# NOTE: this is a workaround to get default value from a parameter
self._api.fail(
'Positional arguments are not allowed for parameters when using the mypy plugin. '
"Update your code to use named arguments, like luigi.Parameter(default='foo') instead of luigi.Parameter('foo')",
expr,
)
continue
args[name] = arg
return True, args
return False, {}
def _infer_type_from_parameters(self, parameter: Expression) -> Type | None:
"""
Generate default type from Parameter.
For example, when parameter is `luigi.parameter.Parameter`, this method should return `str` type.
"""
parameter_name = _extract_parameter_name(parameter)
if parameter_name is None:
return None
underlying_type: Type | None = None
if parameter_name in ['luigi.parameter.Parameter', 'luigi.parameter.OptionalParameter']:
underlying_type = self._api.named_type('builtins.str', [])
elif parameter_name in ['luigi.parameter.IntParameter', 'luigi.parameter.OptionalIntParameter']:
underlying_type = self._api.named_type('builtins.int', [])
elif parameter_name in ['luigi.parameter.FloatParameter', 'luigi.parameter.OptionalFloatParameter']:
underlying_type = self._api.named_type('builtins.float', [])
elif parameter_name in ['luigi.parameter.BoolParameter', 'luigi.parameter.OptionalBoolParameter']:
underlying_type = self._api.named_type('builtins.bool', [])
elif parameter_name in ['luigi.parameter.DateParameter', 'luigi.parameter.MonthParameter', 'luigi.parameter.YearParameter']:
underlying_type = self._api.named_type('datetime.date', [])
elif parameter_name in ['luigi.parameter.DateHourParameter', 'luigi.parameter.DateMinuteParameter', 'luigi.parameter.DateSecondParameter']:
underlying_type = self._api.named_type('datetime.datetime', [])
elif parameter_name in ['luigi.parameter.TimeDeltaParameter']:
underlying_type = self._api.named_type('datetime.timedelta', [])
elif parameter_name in ['luigi.parameter.DictParameter', 'luigi.parameter.OptionalDictParameter']:
underlying_type = self._api.named_type('builtins.dict', [AnyType(TypeOfAny.unannotated), AnyType(TypeOfAny.unannotated)])
elif parameter_name in ['luigi.parameter.ListParameter', 'luigi.parameter.OptionalListParameter']:
underlying_type = self._api.named_type('builtins.tuple', [AnyType(TypeOfAny.unannotated)])
elif parameter_name in ['luigi.parameter.TupleParameter', 'luigi.parameter.OptionalTupleParameter']:
underlying_type = self._api.named_type('builtins.tuple', [AnyType(TypeOfAny.unannotated)])
elif parameter_name in ['luigi.parameter.PathParameter', 'luigi.parameter.OptionalPathParameter']:
underlying_type = self._api.named_type('pathlib.Path', [])
elif parameter_name in ['gokart.parameter.TaskInstanceParameter']:
underlying_type = self._api.named_type('gokart.task.TaskOnKart', [AnyType(TypeOfAny.unannotated)])
elif parameter_name in ['gokart.parameter.ListTaskInstanceParameter']:
underlying_type = self._api.named_type('builtins.list', [self._api.named_type('gokart.task.TaskOnKart', [AnyType(TypeOfAny.unannotated)])])
elif parameter_name in ['gokart.parameter.ExplicitBoolParameter']:
underlying_type = self._api.named_type('builtins.bool', [])
elif parameter_name in ['luigi.parameter.NumericalParameter']:
underlying_type = self._get_type_from_args(parameter, 'var_type')
elif parameter_name in ['luigi.parameter.ChoiceParameter']:
underlying_type = self._get_type_from_args(parameter, 'var_type')
elif parameter_name in ['luigi.parameter.ChoiceListParameter']:
base_type = self._get_type_from_args(parameter, 'var_type')
if base_type is not None:
underlying_type = self._api.named_type('builtins.tuple', [base_type])
elif parameter_name in ['luigi.parameter.EnumParameter']:
underlying_type = self._get_type_from_args(parameter, 'enum')
elif parameter_name in ['luigi.parameter.EnumListParameter']:
base_type = self._get_type_from_args(parameter, 'enum')
if base_type is not None:
underlying_type = self._api.named_type('builtins.tuple', [base_type])
if underlying_type is None:
return None
# When parameter has Optional, it can be none value.
if 'Optional' in parameter_name:
return UnionType([underlying_type, NoneType()])
return underlying_type
def _get_type_from_args(self, parameter: Expression, arg_key: str) -> Type | None:
"""
get type from parameter arguments.
e.x)
When parameter is `luigi.ChoiceParameter(var_type=int)`, this method should return `int` type.
"""
ok, args = self._collect_parameter_args(parameter)
if not ok:
return None
if arg_key not in args:
return None
arg = args[arg_key]
if not isinstance(arg, NameExpr):
return None
if not isinstance(arg.node, TypeInfo):
return None
return Instance(arg.node, [])
def is_parameter_call(expr: Expression) -> bool:
"""Checks if the expression is a call to luigi.Parameter()"""
parameter_name = _extract_parameter_name(expr)
if parameter_name is None:
return False
return PARAMETER_FULLNAME_MATCHER.match(parameter_name) is not None
def _extract_parameter_name(expr: Expression) -> str | None:
"""Extract name if the expression is a call to luigi.Parameter()"""
if not isinstance(expr, CallExpr):
return None
callee = expr.callee
if isinstance(callee, MemberExpr):
type_info = callee.node
if type_info is None and isinstance(callee.expr, NameExpr):
return f'{callee.expr.name}.{callee.name}'
elif isinstance(callee, NameExpr):
type_info = callee.node
else:
return None
if isinstance(type_info, TypeInfo):
return type_info.fullname
# Currently, luigi doesn't provide py.typed. it will be released next to 3.5.1.
# https://github.com/spotify/luigi/pull/3297
# With the following code, we can't assume correctly.
#
# from luigi import Parameter
# class MyTask(gokart.TaskOnKart):
# param = Parameter()
if isinstance(type_info, Var) and luigi.__version__ <= '3.5.1':
return type_info.name
return None
def plugin(version: str) -> type[Plugin]:
return TaskOnKartPlugin
================================================
FILE: gokart/object_storage.py
================================================
from __future__ import annotations
from datetime import datetime
from typing import cast
import luigi
import luigi.contrib.gcs
import luigi.contrib.s3
from luigi.format import Format
from gokart.gcs_config import GCSConfig
from gokart.gcs_zip_client import GCSZipClient
from gokart.s3_config import S3Config
from gokart.s3_zip_client import S3ZipClient
from gokart.zip_client import ZipClient
object_storage_path_prefix = ['s3://', 'gs://']
class ObjectStorage:
@staticmethod
def if_object_storage_path(path: str) -> bool:
for prefix in object_storage_path_prefix:
if path.startswith(prefix):
return True
return False
@staticmethod
def get_object_storage_target(path: str, format: Format) -> luigi.target.FileSystemTarget:
if path.startswith('s3://'):
return luigi.contrib.s3.S3Target(path, client=S3Config().get_s3_client(), format=format)
elif path.startswith('gs://'):
return luigi.contrib.gcs.GCSTarget(path, client=GCSConfig().get_gcs_client(), format=format)
else:
raise
@staticmethod
def exists(path: str) -> bool:
if path.startswith('s3://'):
return cast(bool, S3Config().get_s3_client().exists(path))
elif path.startswith('gs://'):
return cast(bool, GCSConfig().get_gcs_client().exists(path))
else:
raise
@staticmethod
def get_timestamp(path: str) -> datetime:
if path.startswith('s3://'):
return cast(datetime, S3Config().get_s3_client().get_key(path).last_modified)
elif path.startswith('gs://'):
# for gcs object
# should PR to luigi
bucket, obj = GCSConfig().get_gcs_client()._path_to_bucket_and_key(path)
result = GCSConfig().get_gcs_client().client.objects().get(bucket=bucket, object=obj).execute()
return cast(datetime, result['updated'])
else:
raise
@staticmethod
def get_zip_client(file_path: str, temporary_directory: str) -> ZipClient:
if file_path.startswith('s3://'):
return S3ZipClient(file_path=file_path, temporary_directory=temporary_directory)
elif file_path.startswith('gs://'):
return GCSZipClient(file_path=file_path, temporary_directory=temporary_directory)
else:
raise
@staticmethod
def is_buffered_reader(file: object) -> bool:
return not isinstance(file, luigi.contrib.s3.ReadableS3File)
================================================
FILE: gokart/pandas_type_config.py
================================================
from __future__ import annotations
from abc import abstractmethod
from logging import getLogger
from typing import Any
import luigi
import numpy as np
import pandas as pd
from luigi.task_register import Register
logger = getLogger(__name__)
class PandasTypeError(Exception):
"""Raised when the type of the pandas DataFrame column is not as expected."""
class PandasTypeConfig(luigi.Config):
@classmethod
@abstractmethod
def type_dict(cls) -> dict[str, Any]:
pass
@classmethod
def check(cls, df: pd.DataFrame) -> None:
for column_name, column_type in cls.type_dict().items():
cls._check_column(df, column_name, column_type)
@classmethod
def _check_column(cls, df: pd.DataFrame, column_name: str, column_type: type) -> None:
if column_name not in df.columns:
return
if not np.all(list(map(lambda x: isinstance(x, column_type), df[column_name]))):
not_match = next(filter(lambda x: not isinstance(x, column_type), df[column_name]))
raise PandasTypeError(f'expected type is "{column_type}", but "{type(not_match)}" is passed in column "{column_name}".')
class PandasTypeConfigMap(luigi.Config):
"""To initialize this class only once, this inherits luigi.Config."""
def __init__(self, *args: Any, **kwargs: Any) -> None:
super().__init__(*args, **kwargs)
task_names = Register.task_names()
task_classes = [Register.get_task_cls(task_name) for task_name in task_names]
self._map = {
task_class.task_namespace: task_class for task_class in task_classes if issubclass(task_class, PandasTypeConfig) and task_class != PandasTypeConfig
}
def check(self, obj: Any, task_namespace: str) -> None:
if isinstance(obj, pd.DataFrame) and task_namespace in self._map:
self._map[task_namespace].check(obj)
================================================
FILE: gokart/parameter.py
================================================
from __future__ import annotations
import bz2
import datetime
import json
import sys
from logging import getLogger
from typing import Any, Generic, Protocol, TypeVar
if sys.version_info >= (3, 11):
from typing import Unpack
else:
from typing_extensions import Unpack
from warnings import warn
import luigi
from luigi import task_register
try:
from luigi.parameter import _no_value, _NoValueType, _ParameterKwargs
except ImportError:
_no_value = None # type: ignore[assignment]
_NoValueType = type(None) # type: ignore[assignment,misc]
_ParameterKwargs = dict # type: ignore[assignment,misc]
import gokart
logger = getLogger(__name__)
TASK_ON_KART_TYPE = TypeVar('TASK_ON_KART_TYPE', bound='gokart.TaskOnKart') # type: ignore
class TaskInstanceParameter(luigi.Parameter[TASK_ON_KART_TYPE], Generic[TASK_ON_KART_TYPE]):
def __init__(
self,
expected_type: type[TASK_ON_KART_TYPE] | None = None,
default: TASK_ON_KART_TYPE | _NoValueType = _no_value,
**kwargs: Unpack[_ParameterKwargs],
):
if expected_type is None:
self.expected_type: type = gokart.TaskOnKart
elif isinstance(expected_type, type):
self.expected_type = expected_type
else:
raise TypeError(f'expected_type must be a type, not {type(expected_type)}')
super().__init__(default=default, **kwargs)
@staticmethod
def _recursive(param_dict):
params = param_dict['params']
task_cls = task_register.Register.get_task_cls(param_dict['type'])
for key, value in task_cls.get_params():
if key in params:
params[key] = value.parse(params[key])
return task_cls(**params)
@staticmethod
def _recursive_decompress(s):
s = dict(luigi.DictParameter().parse(s))
if 'params' in s:
s['params'] = TaskInstanceParameter._recursive_decompress(bz2.decompress(bytes.fromhex(s['params'])).decode())
return s
def parse(self, s):
if isinstance(s, str):
s = self._recursive_decompress(s)
return self._recursive(s)
def serialize(self, x):
params = bz2.compress(json.dumps(x.to_str_params(only_significant=True)).encode()).hex()
values = dict(type=x.get_task_family(), params=params)
return luigi.DictParameter().serialize(values)
def _warn_on_wrong_param_type(self, param_name, param_value):
if not isinstance(param_value, self.expected_type):
raise TypeError(f'{param_value} is not an instance of {self.expected_type}')
class _TaskInstanceEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, luigi.Task):
return TaskInstanceParameter().serialize(obj)
# Let the base class default method raise the TypeError
return json.JSONEncoder.default(self, obj)
class ListTaskInstanceParameter(luigi.Parameter[list[TASK_ON_KART_TYPE]], Generic[TASK_ON_KART_TYPE]):
def __init__(
self,
expected_elements_type: type[TASK_ON_KART_TYPE] | None = None,
default: list[TASK_ON_KART_TYPE] | _NoValueType = _no_value,
**kwargs: Unpack[_ParameterKwargs],
):
if expected_elements_type is None:
self.expected_elements_type: type = gokart.TaskOnKart
elif isinstance(expected_elements_type, type):
self.expected_elements_type = expected_elements_type
else:
raise TypeError(f'expected_elements_type must be a type, not {type(expected_elements_type)}')
super().__init__(default=default, **kwargs)
def parse(self, s):
return [TaskInstanceParameter().parse(x) for x in list(json.loads(s))]
def serialize(self, x):
return json.dumps(x, cls=_TaskInstanceEncoder)
def _warn_on_wrong_param_type(self, param_name, param_value):
for v in param_value:
if not isinstance(v, self.expected_elements_type):
raise TypeError(f'{v} is not an instance of {self.expected_elements_type}')
class ExplicitBoolParameter(luigi.BoolParameter):
def __init__(self, *args, **kwargs):
luigi.Parameter.__init__(self, *args, **kwargs)
def _parser_kwargs(self, *args, **kwargs): # type: ignore
return luigi.Parameter._parser_kwargs(*args, *kwargs)
T = TypeVar('T')
class Serializable(Protocol):
def gokart_serialize(self) -> str:
"""Implement this method to serialize the object as an parameter
You can omit some fields from results of serialization if you want to ignore changes of them
"""
...
@classmethod
def gokart_deserialize(cls: type[T], s: str) -> T:
"""Implement this method to deserialize the object from a string"""
...
S = TypeVar('S', bound=Serializable)
class SerializableParameter(luigi.Parameter[S], Generic[S]):
def __init__(self, object_type: type[S], *args: Any, **kwargs: Any) -> None:
self._object_type = object_type
super().__init__(*args, **kwargs)
def parse(self, s: str) -> S:
return self._object_type.gokart_deserialize(s)
def serialize(self, x: S) -> str:
return x.gokart_serialize()
class ZonedDateSecondParameter(luigi.Parameter[datetime.datetime]):
"""
ZonedDateSecondParameter supports a datetime.datetime object with timezone information.
A ZonedDateSecondParameter is a `ISO 8601 <http://en.wikipedia.org/wiki/ISO_8601>`_ formatted
date, time specified to the second and timezone. For example, ``2013-07-10T19:07:38+09:00`` specifies July 10, 2013 at
19:07:38 +09:00. The separator `:` can be omitted for Python3.11 and later.
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
def parse(self, s):
# special character 'Z' is replaced with '+00:00'
# because Python 3.11 and later support fromisoformat with Z at the end of the string.
if s.endswith('Z'):
s = s[:-1] + '+00:00'
dt = datetime.datetime.fromisoformat(s)
if dt.tzinfo is None:
warn('The input does not have timezone information. Please consider using luigi.DateSecondParameter instead.', stacklevel=1)
return dt
def serialize(self, dt):
return dt.isoformat()
def normalize(self, dt):
# override _DatetimeParameterBase.normalize to avoid do nothing to normalize except removing microsecond.
# microsecond is removed because the number of digits of microsecond is not fixed.
# See also luigi's implementation https://github.com/spotify/luigi/blob/v3.6.0/luigi/parameter.py#L612
return dt.replace(microsecond=0)
================================================
FILE: gokart/py.typed
================================================
================================================
FILE: gokart/required_task_output.py
================================================
from dataclasses import dataclass
@dataclass
class RequiredTaskOutput:
task_name: str
output_path: str
def serialize(self) -> dict[str, str]:
return {'__gokart_task_name': self.task_name, '__gokart_output_path': self.output_path}
================================================
FILE: gokart/run.py
================================================
from __future__ import annotations
import logging
import os
import sys
from logging import getLogger
from typing import Any
import luigi
import luigi.cmdline
import luigi.cmdline_parser
import luigi.execution_summary
import luigi.interface
import luigi.retcodes
import luigi.setup_logging
from luigi.cmdline_parser import CmdlineParser
import gokart
import gokart.slack
from gokart.build import WorkerSchedulerFactory
from gokart.object_storage import ObjectStorage
logger = getLogger(__name__)
def _run_tree_info(cmdline_args, details):
with CmdlineParser.global_instance(cmdline_args) as cp:
gokart.tree_info().output().dump(gokart.make_tree_info(cp.get_task_obj(), details=details))
def _try_tree_info(cmdline_args):
with CmdlineParser.global_instance(cmdline_args):
mode = gokart.tree_info().mode
output_path = gokart.tree_info().output().path()
# do nothing if `mode` is empty.
if mode == '':
return
# output tree info and exit.
if mode == 'simple':
_run_tree_info(cmdline_args, details=False)
elif mode == 'all':
_run_tree_info(cmdline_args, details=True)
else:
raise ValueError(f'--tree-info-mode must be "simple" or "all", but "{mode}" is passed.')
logger.info(f'output tree info: {output_path}')
sys.exit()
def _try_to_delete_unnecessary_output_file(cmdline_args: list[str]) -> None:
with CmdlineParser.global_instance(cmdline_args) as cp:
task: gokart.TaskOnKart[Any] = cp.get_task_obj()
if task.delete_unnecessary_output_files:
if ObjectStorage.if_object_storage_path(task.workspace_directory):
logger.info('delete-unnecessary-output-files is not support s3/gcs.')
else:
gokart.delete_local_unnecessary_outputs(task)
sys.exit()
def _try_get_slack_api(cmdline_args: list[str]) -> gokart.slack.SlackAPI | None:
with CmdlineParser.global_instance(cmdline_args):
config = gokart.slack.SlackConfig()
token = os.getenv(config.token_name, '')
channel = config.channel
to_user = config.to_user
if token and channel:
logger.info('Slack notification is activated.')
return gokart.slack.SlackAPI(token=token, channel=channel, to_user=to_user)
logger.info('Slack notification is not activated.')
return None
def _try_to_send_event_summary_to_slack(
slack_api: gokart.slack.SlackAPI | None, event_aggregator: gokart.slack.EventAggregator, cmdline_args: list[str]
) -> None:
if slack_api is None:
# do nothing
return
options = gokart.slack.SlackConfig()
with CmdlineParser.global_instance(cmdline_args) as cp:
task = cp.get_task_obj()
tree_info = gokart.make_tree_info(task, details=True) if options.send_tree_info else 'Please add SlackConfig.send_tree_info to include tree-info'
task_name = type(task).__name__
comment = f'Report of {task_name}' + os.linesep + event_aggregator.get_summary()
content = os.linesep.join(['===== Event List ====', event_aggregator.get_event_list(), os.linesep, '==== Tree Info ====', tree_info])
slack_api.send_snippet(comment=comment, title='event.txt', content=content)
def _run_with_retcodes(argv):
"""run_with_retcodes equivalent that uses gokart's WorkerSchedulerFactory."""
retcode_logger = logging.getLogger('luigi-interface')
with luigi.cmdline_parser.CmdlineParser.global_instance(argv):
retcodes = luigi.retcodes.retcode()
worker = None
try:
worker = luigi.interface._run(argv, worker_scheduler_factory=WorkerSchedulerFactory()).worker
except luigi.interface.PidLockAlreadyTakenExit:
sys.exit(retcodes.already_running)
except Exception:
env_params = luigi.interface.core()
luigi.setup_logging.InterfaceLogging.setup(env_params)
retcode_logger.exception('Uncaught exception in luigi')
sys.exit(retcodes.unhandled_exception)
with luigi.cmdline_parser.CmdlineParser.global_instance(argv):
task_sets = luigi.execution_summary._summary_dict(worker)
root_task = luigi.execution_summary._root_task(worker)
non_empty_categories = {k: v for k, v in task_sets.items() if v}.keys()
def has(status):
assert status in luigi.execution_summary._ORDERED_STATUSES
return status in non_empty_categories
codes_and_conds = (
(retcodes.missing_data, has('still_pending_ext')),
(retcodes.task_failed, has('failed')),
(retcodes.already_running, has('run_by_other_worker')),
(retcodes.scheduling_error, has('scheduling_error')),
(retcodes.not_run, has('not_run')),
)
expected_ret_code = max(code * (1 if cond else 0) for code, cond in codes_and_conds)
if expected_ret_code == 0 and root_task not in task_sets['completed'] and root_task not in task_sets['already_done']:
sys.exit(retcodes.not_run)
else:
sys.exit(expected_ret_code)
def run(cmdline_args=None, set_retcode=True):
cmdline_args = cmdline_args or sys.argv[1:]
if set_retcode:
luigi.retcodes.retcode.already_running = 10 # type: ignore
luigi.retcodes.retcode.missing_data = 20 # type: ignore
luigi.retcodes.retcode.not_run = 30 # type: ignore
luigi.retcodes.retcode.task_failed = 40 # type: ignore
luigi.retcodes.retcode.scheduling_error = 50 # type: ignore
_try_tree_info(cmdline_args)
_try_to_delete_unnecessary_output_file(cmdline_args)
gokart.testing.try_to_run_test_for_empty_data_frame(cmdline_args)
slack_api = _try_get_slack_api(cmdline_args)
event_aggregator = gokart.slack.EventAggregator()
try:
event_aggregator.set_handlers()
_run_with_retcodes(cmdline_args)
except SystemExit as e:
_try_to_send_event_summary_to_slack(slack_api, event_aggregator, cmdline_args)
sys.exit(e.code)
================================================
FILE: gokart/s3_config.py
================================================
from __future__ import annotations
import os
import luigi
import luigi.contrib.s3
class S3Config(luigi.Config):
aws_access_key_id_name = luigi.Parameter(default='AWS_ACCESS_KEY_ID', description='AWS access key id environment variable.')
aws_secret_access_key_name = luigi.Parameter(default='AWS_SECRET_ACCESS_KEY', description='AWS secret access key environment variable.')
_client = None
def get_s3_client(self) -> luigi.contrib.s3.S3Client:
if self._client is None: # use cache as like singleton object
self._client = self._get_s3_client()
return self._client
def _get_s3_client(self) -> luigi.contrib.s3.S3Client:
return luigi.contrib.s3.S3Client(
aws_access_key_id=os.environ.get(self.aws_access_key_id_name), aws_secret_access_key=os.environ.get(self.aws_secret_access_key_name)
)
================================================
FILE: gokart/s3_zip_client.py
================================================
from __future__ import annotations
import os
import shutil
from typing import cast
from gokart.s3_config import S3Config
from gokart.zip_client import ZipClient, _unzip_file
class S3ZipClient(ZipClient):
def __init__(self, file_path: str, temporary_directory: str) -> None:
self._file_path = file_path
self._temporary_directory = temporary_directory
self._client = S3Config().get_s3_client()
def exists(self) -> bool:
return cast(bool, self._client.exists(self._file_path))
def make_archive(self) -> None:
extension = os.path.splitext(self._file_path)[1]
if not os.path.exists(self._temporary_directory):
# Check path existence since shutil.make_archive() of python 3.10+ does not check it.
raise FileNotFoundError(f'Temporary directory {self._temporary_directory} is not found.')
shutil.make_archive(base_name=self._temporary_directory, format=extension[1:], root_dir=self._temporary_directory)
self._client.put(self._temporary_file_path(), self._file_path)
def unpack_archive(self) -> None:
os.makedirs(self._temporary_directory, exist_ok=True)
self._client.get(self._file_path, self._temporary_file_path())
_unzip_file(fp=self._temporary_file_path(), extract_dir=self._temporary_directory)
def remove(self) -> None:
self._client.remove(self._file_path)
@property
def path(self) -> str:
return self._file_path
def _temporary_file_path(self):
extension = os.path.splitext(self._file_path)[1]
base_name = self._temporary_directory
if base_name.endswith('/'):
base_name = base_name[:-1]
return base_name + extension
================================================
FILE: gokart/slack/__init__.py
================================================
from gokart.slack.event_aggregator import EventAggregator
from gokart.slack.slack_api import SlackAPI
from gokart.slack.slack_config import SlackConfig
from .slack_api import ChannelListNotLoadedError, ChannelNotFoundError, FileNotUploadedError
__all__ = [
'ChannelListNotLoadedError',
'ChannelNotFoundError',
'FileNotUploadedError',
'EventAggregator',
'SlackAPI',
'SlackConfig',
]
================================================
FILE: gokart/slack/event_aggregator.py
================================================
from __future__ import annotations
import os
from logging import getLogger
from typing import Any, TypedDict
import luigi
logger = getLogger(__name__)
class FailureEvent(TypedDict):
task: str
exception: str
class EventAggregator:
def __init__(self) -> None:
self._success_events: list[str] = []
self._failure_events: list[FailureEvent] = []
def set_handlers(self):
handlers = [(luigi.Event.SUCCESS, self._success), (luigi.Event.FAILURE, self._failure)]
for event, handler in handlers:
luigi.Task.event_handler(event)(handler)
def get_summary(self) -> str:
return f'Success: {len(self._success_events)}; Failure: {len(self._failure_events)}'
def get_event_list(self) -> str:
message = ''
if len(self._failure_events) != 0:
failure_message = os.linesep.join([f'Task: {failure["task"]}; Exception: {failure["exception"]}' for failure in self._failure_events])
message += '---- Failure Tasks ----' + os.linesep + failure_message
if len(self._success_events) != 0:
success_message = os.linesep.join(self._success_events)
message += '---- Success Tasks ----' + os.linesep + success_message
if message == '':
message = 'Tasks were not executed.'
return message
def _success(self, task):
self._success_events.append(self._task_to_str(task))
def _failure(self, task, exception):
failure: FailureEvent = {'task': self._task_to_str(task), 'exception': str(exception)}
self._failure_events.append(failure)
@staticmethod
def _task_to_str(task: Any) -> str:
return f'{type(task).__name__}:[{task.make_unique_id()}]'
================================================
FILE: gokart/slack/slack_api.py
================================================
from __future__ import annotations
from logging import getLogger
import slack_sdk
logger = getLogger(__name__)
class ChannelListNotLoadedError(RuntimeError):
pass
class ChannelNotFoundError(RuntimeError):
pass
class FileNotUploadedError(RuntimeError):
pass
class SlackAPI:
def __init__(self, token: str, channel: str, to_user: str) -> None:
self._client = slack_sdk.WebClient(token=token)
self._channel_id = self._get_channel_id(channel)
self._to_user = to_user if to_user == '' or to_user.startswith('@') else '@' + to_user
def _get_channel_id(self, channel_name):
params = {'exclude_archived': True, 'limit': 100}
try:
for channels in self._client.conversations_list(params=params):
if not channels:
raise ChannelListNotLoadedError('Channel list is empty.')
for channel in channels.get('channels', []):
if channel['name'] == channel_name:
return channel['id']
raise ChannelNotFoundError(f'Channel {channel_name} is not found in public channels.')
except Exception as e:
logger.warning(f'The job will start without slack notification: {e}')
def send_snippet(self, comment, title, content):
try:
request_body = dict(
channels=self._channel_id, initial_comment=f'<{self._to_user}> {comment}' if self._to_user else comment, content=content, title=title
)
response = self._client.api_call('files.upload', data=request_body)
if not response['ok']:
raise FileNotUploadedError(f'Error while uploading file. The error reason is "{response["error"]}".')
except Exception as e:
logger.warning(f'Failed to send slack notification: {e}')
==================
gitextract_mhb_xs2d/ ├── .github/ │ ├── CODEOWNERS │ └── workflows/ │ ├── format.yml │ ├── publish.yml │ └── test.yml ├── .gitignore ├── .readthedocs.yaml ├── LICENSE ├── README.md ├── docs/ │ ├── Makefile │ ├── conf.py │ ├── efficient_run_on_multi_workers.rst │ ├── for_pandas.rst │ ├── gokart.rst │ ├── index.rst │ ├── intro_to_gokart.rst │ ├── logging.rst │ ├── make.bat │ ├── mypy_plugin.rst │ ├── polars.rst │ ├── requirements.txt │ ├── setting_task_parameters.rst │ ├── slack_notification.rst │ ├── task_information.rst │ ├── task_on_kart.rst │ ├── task_parameters.rst │ ├── task_settings.rst │ ├── tutorial.rst │ └── using_task_task_conflict_prevention_lock.rst ├── examples/ │ ├── gokart_notebook_example.ipynb │ ├── logging.ini │ └── param.ini ├── gokart/ │ ├── __init__.py │ ├── build.py │ ├── config_params.py │ ├── conflict_prevention_lock/ │ │ ├── task_lock.py │ │ └── task_lock_wrappers.py │ ├── errors/ │ │ └── __init__.py │ ├── file_processor/ │ │ ├── __init__.py │ │ ├── base.py │ │ ├── pandas.py │ │ └── polars.py │ ├── file_processor.py │ ├── gcs_config.py │ ├── gcs_obj_metadata_client.py │ ├── gcs_zip_client.py │ ├── in_memory/ │ │ ├── __init__.py │ │ ├── data.py │ │ ├── repository.py │ │ └── target.py │ ├── info.py │ ├── mypy.py │ ├── object_storage.py │ ├── pandas_type_config.py │ ├── parameter.py │ ├── py.typed │ ├── required_task_output.py │ ├── run.py │ ├── s3_config.py │ ├── s3_zip_client.py │ ├── slack/ │ │ ├── __init__.py │ │ ├── event_aggregator.py │ │ ├── slack_api.py │ │ └── slack_config.py │ ├── target.py │ ├── task.py │ ├── task_complete_check.py │ ├── testing/ │ │ ├── __init__.py │ │ ├── check_if_run_with_empty_data_frame.py │ │ └── pandas_assert.py │ ├── tree/ │ │ ├── task_info.py │ │ └── task_info_formatter.py │ ├── utils.py │ ├── worker.py │ ├── workspace_management.py │ ├── zip_client.py │ └── zip_client_util.py ├── luigi.cfg ├── pyproject.toml ├── test/ │ ├── __init__.py │ ├── config/ │ │ ├── __init__.py │ │ ├── pyproject.toml │ │ ├── pyproject_disallow_missing_parameters.toml │ │ └── test_config.ini │ ├── conflict_prevention_lock/ │ │ ├── __init__.py │ │ ├── test_task_lock.py │ │ └── test_task_lock_wrappers.py │ ├── file_processor/ │ │ ├── __init__.py │ │ ├── test_base.py │ │ ├── test_factory.py │ │ ├── test_pandas.py │ │ └── test_polars.py │ ├── in_memory/ │ │ ├── test_in_memory_target.py │ │ └── test_repository.py │ ├── slack/ │ │ ├── __init__.py │ │ └── test_slack_api.py │ ├── test_build.py │ ├── test_cache_unique_id.py │ ├── test_config_params.py │ ├── test_explicit_bool_parameter.py │ ├── test_gcs_config.py │ ├── test_gcs_obj_metadata_client.py │ ├── test_info.py │ ├── test_large_data_fram_processor.py │ ├── test_list_task_instance_parameter.py │ ├── test_mypy.py │ ├── test_pandas_type_check_framework.py │ ├── test_pandas_type_config.py │ ├── test_restore_task_by_id.py │ ├── test_run.py │ ├── test_s3_config.py │ ├── test_s3_zip_client.py │ ├── test_serializable_parameter.py │ ├── test_target.py │ ├── test_task_instance_parameter.py │ ├── test_task_on_kart.py │ ├── test_utils.py │ ├── test_worker.py │ ├── test_zoned_date_second_parameter.py │ ├── testing/ │ │ ├── __init__.py │ │ └── test_pandas_assert.py │ ├── tree/ │ │ ├── __init__.py │ │ ├── test_task_info.py │ │ └── test_task_info_formatter.py │ └── util.py └── tox.ini
SYMBOL INDEX (1009 symbols across 74 files)
FILE: gokart/build.py
class LoggerConfig (line 27) | class LoggerConfig:
method __init__ (line 28) | def __init__(self, level: int):
method __enter__ (line 33) | def __enter__(self):
method __exit__ (line 38) | def __exit__(self, exception_type, exception_value, traceback):
class GokartBuildError (line 43) | class GokartBuildError(Exception):
method __init__ (line 46) | def __init__(self, message: str, raised_exceptions: dict[str, list[Exc...
class HasLockedTaskException (line 51) | class HasLockedTaskException(Exception):
class TaskLockExceptionRaisedFlag (line 55) | class TaskLockExceptionRaisedFlag:
method __init__ (line 56) | def __init__(self):
class WorkerProtocol (line 60) | class WorkerProtocol(Protocol):
method add (line 65) | def add(self, task: TaskOnKart[Any]) -> bool: ...
method run (line 67) | def run(self) -> bool: ...
method __enter__ (line 69) | def __enter__(self) -> WorkerProtocol: ...
method __exit__ (line 71) | def __exit__(self, type: Any, value: Any, traceback: Any) -> Literal[F...
class WorkerSchedulerFactory (line 74) | class WorkerSchedulerFactory:
method create_local_scheduler (line 75) | def create_local_scheduler(self) -> scheduler.Scheduler:
method create_remote_scheduler (line 78) | def create_remote_scheduler(self, url: str) -> rpc.RemoteScheduler:
method create_worker (line 81) | def create_worker(self, scheduler: scheduler.Scheduler, worker_process...
function _get_output (line 85) | def _get_output(task: TaskOnKart[T]) -> T:
function _reset_register (line 97) | def _reset_register(keep={'gokart', 'luigi'}):
class TaskDumpMode (line 109) | class TaskDumpMode(enum.Enum):
class TaskDumpOutputType (line 115) | class TaskDumpOutputType(enum.Enum):
class TaskDumpConfig (line 122) | class TaskDumpConfig:
function process_task_info (line 127) | def process_task_info(task: TaskOnKart[Any], task_dump_config: TaskDumpC...
function build (line 151) | def build(
function build (line 163) | def build(
function build (line 174) | def build(
FILE: gokart/config_params.py
class inherits_config_params (line 10) | class inherits_config_params:
method __init__ (line 11) | def __init__(self, config_class: type[luigi.Config], parameter_alias: ...
method __call__ (line 23) | def __call__(self, task_class: type[gokart.TaskOnKart[Any]]) -> type[g...
FILE: gokart/conflict_prevention_lock/task_lock.py
class TaskLockParams (line 14) | class TaskLockParams(NamedTuple):
class TaskLockException (line 24) | class TaskLockException(Exception):
class RedisClient (line 29) | class RedisClient:
method __new__ (line 32) | def __new__(cls, *args, **kwargs):
method __init__ (line 40) | def __init__(self, host: str | None, port: int | None) -> None:
method get_redis_client (line 46) | def get_redis_client(self):
function _extend_lock (line 50) | def _extend_lock(task_lock: redis.lock.Lock, redis_timeout: int) -> None:
function set_task_lock (line 54) | def set_task_lock(task_lock_params: TaskLockParams) -> redis.lock.Lock:
function set_lock_scheduler (line 63) | def set_lock_scheduler(task_lock: redis.lock.Lock, task_lock_params: Tas...
function make_task_lock_key (line 78) | def make_task_lock_key(file_path: str, unique_id: str | None) -> str:
function make_task_lock_params (line 83) | def make_task_lock_params(
function make_task_lock_params_for_run (line 108) | def make_task_lock_params_for_run(task_self: Any, lock_extend_seconds: i...
FILE: gokart/conflict_prevention_lock/task_lock_wrappers.py
function wrap_dump_with_lock (line 17) | def wrap_dump_with_lock(func: Callable[P, R], task_lock_params: TaskLock...
function wrap_load_with_lock (line 43) | def wrap_load_with_lock(func: Callable[P, R], task_lock_params: TaskLock...
function wrap_remove_with_lock (line 66) | def wrap_remove_with_lock(func: Callable[P, R], task_lock_params: TaskLo...
function wrap_run_with_lock (line 94) | def wrap_run_with_lock(run_func: Callable[[], R], task_lock_params: Task...
FILE: gokart/file_processor/__init__.py
class CsvFileProcessor (line 35) | class CsvFileProcessor(FileProcessor):
method __init__ (line 38) | def __init__(self, sep: str = ',', encoding: str = 'utf-8', dataframe_...
method format (line 58) | def format(self):
method load (line 61) | def load(self, file):
method dump (line 64) | def dump(self, obj, file):
class JsonFileProcessor (line 68) | class JsonFileProcessor(FileProcessor):
method __init__ (line 71) | def __init__(self, orient: Literal['split', 'records', 'index', 'table...
method format (line 89) | def format(self):
method load (line 92) | def load(self, file):
method dump (line 95) | def dump(self, obj, file):
class ParquetFileProcessor (line 99) | class ParquetFileProcessor(FileProcessor):
method __init__ (line 102) | def __init__(self, engine: Any = 'pyarrow', compression: Any = None, d...
method format (line 122) | def format(self):
method load (line 125) | def load(self, file):
method dump (line 128) | def dump(self, obj, file):
class FeatherFileProcessor (line 133) | class FeatherFileProcessor(FileProcessor):
method __init__ (line 136) | def __init__(self, store_index_in_feather: bool, dataframe_type: DataF...
method format (line 154) | def format(self):
method load (line 157) | def load(self, file):
method dump (line 160) | def dump(self, obj, file):
function make_file_processor (line 165) | def make_file_processor(file_path: str, store_index_in_feather: bool = T...
FILE: gokart/file_processor/base.py
class FileProcessor (line 22) | class FileProcessor:
method format (line 24) | def format(self) -> Any: ...
method load (line 27) | def load(self, file: Any) -> Any: ...
method dump (line 30) | def dump(self, obj: Any, file: Any) -> None: ...
class BinaryFileProcessor (line 33) | class BinaryFileProcessor(FileProcessor):
method format (line 45) | def format(self):
method load (line 48) | def load(self, file):
method dump (line 51) | def dump(self, obj, file):
class _ChunkedLargeFileReader (line 55) | class _ChunkedLargeFileReader:
method __init__ (line 56) | def __init__(self, file: Any) -> None:
method __getattr__ (line 59) | def __getattr__(self, item):
method read (line 62) | def read(self, n: int) -> bytes:
method readline (line 76) | def readline(self) -> bytes:
method seek (line 79) | def seek(self, offset: int) -> None:
method seekable (line 82) | def seekable(self) -> bool:
class PickleFileProcessor (line 86) | class PickleFileProcessor(FileProcessor):
method format (line 87) | def format(self):
method load (line 90) | def load(self, file):
method dump (line 98) | def dump(self, obj, file):
method _write (line 102) | def _write(buffer, file):
class TextFileProcessor (line 114) | class TextFileProcessor(FileProcessor):
method format (line 115) | def format(self):
method load (line 118) | def load(self, file):
method dump (line 121) | def dump(self, obj, file):
class GzipFileProcessor (line 129) | class GzipFileProcessor(FileProcessor):
method format (line 130) | def format(self):
method load (line 133) | def load(self, file):
method dump (line 136) | def dump(self, obj, file):
class XmlFileProcessor (line 144) | class XmlFileProcessor(FileProcessor):
method format (line 145) | def format(self):
method load (line 148) | def load(self, file):
method dump (line 154) | def dump(self, obj, file):
class NpzFileProcessor (line 159) | class NpzFileProcessor(FileProcessor):
method format (line 160) | def format(self):
method load (line 163) | def load(self, file):
method dump (line 166) | def dump(self, obj, file):
FILE: gokart/file_processor/pandas.py
class CsvFileProcessorPandas (line 17) | class CsvFileProcessorPandas(FileProcessor):
method __init__ (line 20) | def __init__(self, sep: str = ',', encoding: str = 'utf-8') -> None:
method format (line 25) | def format(self):
method load (line 28) | def load(self, file):
method dump (line 34) | def dump(self, obj, file):
class JsonFileProcessorPandas (line 43) | class JsonFileProcessorPandas(FileProcessor):
method __init__ (line 46) | def __init__(self, orient: _JsonOrient | None = None):
method format (line 49) | def format(self):
method load (line 52) | def load(self, file):
method dump (line 58) | def dump(self, obj, file):
class ParquetFileProcessorPandas (line 66) | class ParquetFileProcessorPandas(FileProcessor):
method __init__ (line 69) | def __init__(self, engine: Literal['auto', 'pyarrow', 'fastparquet'] =...
method format (line 74) | def format(self):
method load (line 77) | def load(self, file):
method dump (line 87) | def dump(self, obj, file):
class FeatherFileProcessorPandas (line 94) | class FeatherFileProcessorPandas(FileProcessor):
method __init__ (line 97) | def __init__(self, store_index_in_feather: bool):
method format (line 102) | def format(self):
method load (line 105) | def load(self, file):
method dump (line 127) | def dump(self, obj, file):
FILE: gokart/file_processor/polars.py
class CsvFileProcessorPolars (line 29) | class CsvFileProcessorPolars(FileProcessor):
method __init__ (line 32) | def __init__(self, sep: str = ',', encoding: str = 'utf-8', lazy: bool...
method format (line 40) | def format(self):
method load (line 43) | def load(self, file):
method dump (line 57) | def dump(self, obj, file):
class JsonFileProcessorPolars (line 65) | class JsonFileProcessorPolars(FileProcessor):
method __init__ (line 68) | def __init__(self, orient: str | None = None, lazy: bool = False):
method format (line 74) | def format(self):
method load (line 77) | def load(self, file):
method dump (line 93) | def dump(self, obj, file):
class ParquetFileProcessorPolars (line 104) | class ParquetFileProcessorPolars(FileProcessor):
method __init__ (line 107) | def __init__(self, engine: str = 'pyarrow', compression: _ParquetCompr...
method format (line 115) | def format(self):
method load (line 118) | def load(self, file):
method dump (line 131) | def dump(self, obj, file):
class FeatherFileProcessorPolars (line 140) | class FeatherFileProcessorPolars(FileProcessor):
method __init__ (line 143) | def __init__(self, store_index_in_feather: bool, lazy: bool = False):
method format (line 150) | def format(self):
method load (line 153) | def load(self, file):
method dump (line 166) | def dump(self, obj, file):
FILE: gokart/gcs_config.py
class GCSConfig (line 12) | class GCSConfig(luigi.Config):
method get_gcs_client (line 16) | def get_gcs_client(self) -> luigi.contrib.gcs.GCSClient:
method _get_gcs_client (line 21) | def _get_gcs_client(self) -> luigi.contrib.gcs.GCSClient:
method _load_oauth_credentials (line 24) | def _load_oauth_credentials(self) -> Credentials | None:
FILE: gokart/gcs_obj_metadata_client.py
class GCSObjectMetadataClient (line 21) | class GCSObjectMetadataClient:
method _is_log_related_path (line 31) | def _is_log_related_path(path: str) -> bool:
method _path_to_bucket_and_key (line 36) | def _path_to_bucket_and_key(path: str) -> tuple[str, str]:
method add_task_state_labels (line 43) | def add_task_state_labels(
method _normalize_labels (line 89) | def _normalize_labels(labels: dict[str, Any] | None) -> dict[str, str]:
method _get_patched_obj_metadata (line 93) | def _get_patched_obj_metadata(
method _get_serialized_string (line 120) | def _get_serialized_string(required_task_outputs: FlattenableItems[Req...
method _merge_custom_labels_and_task_params_labels (line 134) | def _merge_custom_labels_and_task_params_labels(
method _adjust_gcs_metadata_limit_size (line 154) | def _adjust_gcs_metadata_limit_size(_labels: dict[str, str]) -> dict[s...
FILE: gokart/gcs_zip_client.py
class GCSZipClient (line 11) | class GCSZipClient(ZipClient):
method __init__ (line 12) | def __init__(self, file_path: str, temporary_directory: str) -> None:
method exists (line 17) | def exists(self) -> bool:
method make_archive (line 20) | def make_archive(self) -> None:
method unpack_archive (line 25) | def unpack_archive(self) -> None:
method remove (line 30) | def remove(self) -> None:
method path (line 34) | def path(self) -> str:
method _temporary_file_path (line 37) | def _temporary_file_path(self):
FILE: gokart/in_memory/data.py
class InMemoryData (line 9) | class InMemoryData:
method create_data (line 14) | def create_data(self, value: Any) -> InMemoryData:
FILE: gokart/in_memory/repository.py
class InMemoryCacheRepository (line 10) | class InMemoryCacheRepository:
method __init__ (line 13) | def __init__(self):
method get_value (line 16) | def get_value(self, key: str) -> Any:
method get_last_modification_time (line 19) | def get_last_modification_time(self, key: str) -> datetime:
method _get_data (line 22) | def _get_data(self, key: str) -> InMemoryData:
method set_value (line 25) | def set_value(self, key: str, obj: Any) -> None:
method has (line 29) | def has(self, key: str) -> bool:
method remove (line 32) | def remove(self, key: str) -> None:
method empty (line 36) | def empty(self) -> bool:
method clear (line 39) | def clear(self) -> None:
method get_gen (line 42) | def get_gen(self) -> Iterator[tuple[str, Any]]:
method size (line 47) | def size(self) -> int:
FILE: gokart/in_memory/target.py
class InMemoryTarget (line 14) | class InMemoryTarget(TargetOnKart):
method __init__ (line 15) | def __init__(self, data_key: str, task_lock_param: TaskLockParams):
method _exists (line 22) | def _exists(self) -> bool:
method _get_task_lock_params (line 25) | def _get_task_lock_params(self) -> TaskLockParams:
method _load (line 28) | def _load(self) -> Any:
method _dump (line 31) | def _dump(
method _remove (line 40) | def _remove(self) -> None:
method _last_modification_time (line 43) | def _last_modification_time(self) -> datetime:
method _path (line 49) | def _path(self) -> str:
function make_in_memory_target (line 54) | def make_in_memory_target(target_key: str, task_lock_params: TaskLockPar...
FILE: gokart/info.py
function make_tree_info (line 14) | def make_tree_info(
class tree_info (line 47) | class tree_info(TaskOnKart[Any]):
method output (line 51) | def output(self):
FILE: gokart/mypy.py
class PluginOptions (line 66) | class PluginOptions(Enum):
class TaskOnKartPluginOptions (line 71) | class TaskOnKartPluginOptions:
method _parse_toml (line 77) | def _parse_toml(cls, config_file: str) -> dict[str, Any]:
method parse_config_file (line 91) | def parse_config_file(cls, config_file: str) -> TaskOnKartPluginOptions:
class TaskOnKartPlugin (line 106) | class TaskOnKartPlugin(Plugin):
method __init__ (line 107) | def __init__(self, options: Options) -> None:
method get_base_class_hook (line 114) | def get_base_class_hook(self, fullname: str) -> Callable[[ClassDefCont...
method get_function_hook (line 128) | def get_function_hook(self, fullname: str) -> Callable[[FunctionContex...
method _task_on_kart_class_maker_callback (line 134) | def _task_on_kart_class_maker_callback(self, ctx: ClassDefContext) -> ...
method _task_on_kart_parameter_field_callback (line 138) | def _task_on_kart_parameter_field_callback(self, ctx: FunctionContext)...
class TaskOnKartAttribute (line 172) | class TaskOnKartAttribute:
method __init__ (line 173) | def __init__(
method to_argument (line 193) | def to_argument(self, current_info: TypeInfo, *, of: Literal['__init__...
method expand_type (line 204) | def expand_type(self, current_info: TypeInfo) -> Type | None:
method to_var (line 214) | def to_var(self, current_info: TypeInfo) -> Var:
method serialize (line 217) | def serialize(self) -> JsonDict:
method deserialize (line 228) | def deserialize(cls, info: TypeInfo, data: JsonDict, api: SemanticAnal...
method expand_typevar_from_subtype (line 233) | def expand_typevar_from_subtype(self, sub_type: TypeInfo) -> None:
method _get_arg_kind_by_options (line 240) | def _get_arg_kind_by_options(self) -> Literal[ArgKind.ARG_NAMED, ArgKi...
class TaskOnKartTransformer (line 257) | class TaskOnKartTransformer:
method __init__ (line 260) | def __init__(
method transform (line 272) | def transform(self) -> bool:
method _get_assignment_statements_from_if_statement (line 296) | def _get_assignment_statements_from_if_statement(self, stmt: IfStmt) -...
method _get_assignment_statements_from_block (line 303) | def _get_assignment_statements_from_block(self, block: Block) -> Itera...
method collect_attributes (line 310) | def collect_attributes(self) -> list[TaskOnKartAttribute] | None:
method _collect_parameter_args (line 418) | def _collect_parameter_args(self, expr: Expression) -> tuple[bool, dic...
method _infer_type_from_parameters (line 438) | def _infer_type_from_parameters(self, parameter: Expression) -> Type |...
method _get_type_from_args (line 500) | def _get_type_from_args(self, parameter: Expression, arg_key: str) -> ...
function is_parameter_call (line 522) | def is_parameter_call(expr: Expression) -> bool:
function _extract_parameter_name (line 530) | def _extract_parameter_name(expr: Expression) -> str | None:
function plugin (line 561) | def plugin(version: str) -> type[Plugin]:
FILE: gokart/object_storage.py
class ObjectStorage (line 20) | class ObjectStorage:
method if_object_storage_path (line 22) | def if_object_storage_path(path: str) -> bool:
method get_object_storage_target (line 29) | def get_object_storage_target(path: str, format: Format) -> luigi.targ...
method exists (line 38) | def exists(path: str) -> bool:
method get_timestamp (line 47) | def get_timestamp(path: str) -> datetime:
method get_zip_client (line 60) | def get_zip_client(file_path: str, temporary_directory: str) -> ZipCli...
method is_buffered_reader (line 69) | def is_buffered_reader(file: object) -> bool:
FILE: gokart/pandas_type_config.py
class PandasTypeError (line 15) | class PandasTypeError(Exception):
class PandasTypeConfig (line 19) | class PandasTypeConfig(luigi.Config):
method type_dict (line 22) | def type_dict(cls) -> dict[str, Any]:
method check (line 26) | def check(cls, df: pd.DataFrame) -> None:
method _check_column (line 31) | def _check_column(cls, df: pd.DataFrame, column_name: str, column_type...
class PandasTypeConfigMap (line 40) | class PandasTypeConfigMap(luigi.Config):
method __init__ (line 43) | def __init__(self, *args: Any, **kwargs: Any) -> None:
method check (line 51) | def check(self, obj: Any, task_namespace: str) -> None:
FILE: gokart/parameter.py
class TaskInstanceParameter (line 34) | class TaskInstanceParameter(luigi.Parameter[TASK_ON_KART_TYPE], Generic[...
method __init__ (line 35) | def __init__(
method _recursive (line 50) | def _recursive(param_dict):
method _recursive_decompress (line 59) | def _recursive_decompress(s):
method parse (line 65) | def parse(self, s):
method serialize (line 70) | def serialize(self, x):
method _warn_on_wrong_param_type (line 75) | def _warn_on_wrong_param_type(self, param_name, param_value):
class _TaskInstanceEncoder (line 80) | class _TaskInstanceEncoder(json.JSONEncoder):
method default (line 81) | def default(self, obj):
class ListTaskInstanceParameter (line 88) | class ListTaskInstanceParameter(luigi.Parameter[list[TASK_ON_KART_TYPE]]...
method __init__ (line 89) | def __init__(
method parse (line 103) | def parse(self, s):
method serialize (line 106) | def serialize(self, x):
method _warn_on_wrong_param_type (line 109) | def _warn_on_wrong_param_type(self, param_name, param_value):
class ExplicitBoolParameter (line 115) | class ExplicitBoolParameter(luigi.BoolParameter):
method __init__ (line 116) | def __init__(self, *args, **kwargs):
method _parser_kwargs (line 119) | def _parser_kwargs(self, *args, **kwargs): # type: ignore
class Serializable (line 126) | class Serializable(Protocol):
method gokart_serialize (line 127) | def gokart_serialize(self) -> str:
method gokart_deserialize (line 134) | def gokart_deserialize(cls: type[T], s: str) -> T:
class SerializableParameter (line 142) | class SerializableParameter(luigi.Parameter[S], Generic[S]):
method __init__ (line 143) | def __init__(self, object_type: type[S], *args: Any, **kwargs: Any) ->...
method parse (line 147) | def parse(self, s: str) -> S:
method serialize (line 150) | def serialize(self, x: S) -> str:
class ZonedDateSecondParameter (line 154) | class ZonedDateSecondParameter(luigi.Parameter[datetime.datetime]):
method __init__ (line 163) | def __init__(self, **kwargs):
method parse (line 166) | def parse(self, s):
method serialize (line 176) | def serialize(self, dt):
method normalize (line 179) | def normalize(self, dt):
FILE: gokart/required_task_output.py
class RequiredTaskOutput (line 5) | class RequiredTaskOutput:
method serialize (line 9) | def serialize(self) -> dict[str, str]:
FILE: gokart/run.py
function _run_tree_info (line 26) | def _run_tree_info(cmdline_args, details):
function _try_tree_info (line 31) | def _try_tree_info(cmdline_args):
function _try_to_delete_unnecessary_output_file (line 51) | def _try_to_delete_unnecessary_output_file(cmdline_args: list[str]) -> N...
function _try_get_slack_api (line 62) | def _try_get_slack_api(cmdline_args: list[str]) -> gokart.slack.SlackAPI...
function _try_to_send_event_summary_to_slack (line 75) | def _try_to_send_event_summary_to_slack(
function _run_with_retcodes (line 92) | def _run_with_retcodes(argv):
function run (line 133) | def run(cmdline_args=None, set_retcode=True):
FILE: gokart/s3_config.py
class S3Config (line 9) | class S3Config(luigi.Config):
method get_s3_client (line 15) | def get_s3_client(self) -> luigi.contrib.s3.S3Client:
method _get_s3_client (line 20) | def _get_s3_client(self) -> luigi.contrib.s3.S3Client:
FILE: gokart/s3_zip_client.py
class S3ZipClient (line 11) | class S3ZipClient(ZipClient):
method __init__ (line 12) | def __init__(self, file_path: str, temporary_directory: str) -> None:
method exists (line 17) | def exists(self) -> bool:
method make_archive (line 20) | def make_archive(self) -> None:
method unpack_archive (line 28) | def unpack_archive(self) -> None:
method remove (line 33) | def remove(self) -> None:
method path (line 37) | def path(self) -> str:
method _temporary_file_path (line 40) | def _temporary_file_path(self):
FILE: gokart/slack/event_aggregator.py
class FailureEvent (line 12) | class FailureEvent(TypedDict):
class EventAggregator (line 17) | class EventAggregator:
method __init__ (line 18) | def __init__(self) -> None:
method set_handlers (line 22) | def set_handlers(self):
method get_summary (line 27) | def get_summary(self) -> str:
method get_event_list (line 30) | def get_event_list(self) -> str:
method _success (line 42) | def _success(self, task):
method _failure (line 45) | def _failure(self, task, exception):
method _task_to_str (line 50) | def _task_to_str(task: Any) -> str:
FILE: gokart/slack/slack_api.py
class ChannelListNotLoadedError (line 10) | class ChannelListNotLoadedError(RuntimeError):
class ChannelNotFoundError (line 14) | class ChannelNotFoundError(RuntimeError):
class FileNotUploadedError (line 18) | class FileNotUploadedError(RuntimeError):
class SlackAPI (line 22) | class SlackAPI:
method __init__ (line 23) | def __init__(self, token: str, channel: str, to_user: str) -> None:
method _get_channel_id (line 28) | def _get_channel_id(self, channel_name):
method send_snippet (line 41) | def send_snippet(self, comment, title, content):
FILE: gokart/slack/slack_config.py
class SlackConfig (line 6) | class SlackConfig(luigi.Config):
FILE: gokart/target.py
class TargetOnKart (line 28) | class TargetOnKart(luigi.Target):
method exists (line 29) | def exists(self) -> bool:
method load (line 32) | def load(self) -> Any:
method dump (line 35) | def dump(
method remove (line 53) | def remove(self) -> None:
method last_modification_time (line 57) | def last_modification_time(self) -> datetime:
method path (line 60) | def path(self) -> str:
method _exists (line 64) | def _exists(self) -> bool:
method _get_task_lock_params (line 68) | def _get_task_lock_params(self) -> TaskLockParams:
method _load (line 72) | def _load(self) -> Any:
method _dump (line 76) | def _dump(
method _remove (line 86) | def _remove(self) -> None:
method _last_modification_time (line 90) | def _last_modification_time(self) -> datetime:
method _path (line 94) | def _path(self) -> str:
class SingleFileTarget (line 98) | class SingleFileTarget(TargetOnKart):
method __init__ (line 99) | def __init__(
method _exists (line 109) | def _exists(self) -> bool:
method _get_task_lock_params (line 112) | def _get_task_lock_params(self) -> TaskLockParams:
method _load (line 115) | def _load(self) -> Any:
method _dump (line 119) | def _dump(
method _remove (line 133) | def _remove(self) -> None:
method _last_modification_time (line 136) | def _last_modification_time(self) -> datetime:
method _path (line 139) | def _path(self) -> str:
class ModelTarget (line 143) | class ModelTarget(TargetOnKart):
method __init__ (line 144) | def __init__(
method _exists (line 158) | def _exists(self) -> bool:
method _get_task_lock_params (line 161) | def _get_task_lock_params(self) -> TaskLockParams:
method _load (line 164) | def _load(self) -> Any:
method _dump (line 171) | def _dump(
method _remove (line 186) | def _remove(self) -> None:
method _last_modification_time (line 189) | def _last_modification_time(self) -> datetime:
method _path (line 192) | def _path(self) -> str:
method _model_path (line 195) | def _model_path(self):
method _load_function_path (line 198) | def _load_function_path(self):
method _remove_temporary_directory (line 201) | def _remove_temporary_directory(self):
method _make_temporary_directory (line 204) | def _make_temporary_directory(self):
class LargeDataFrameProcessor (line 208) | class LargeDataFrameProcessor:
method __init__ (line 209) | def __init__(self, max_byte: int):
method save (line 212) | def save(self, df: pd.DataFrame, file_path: str) -> None:
method load (line 226) | def load(file_path: str) -> pd.DataFrame:
function _make_file_system_target (line 232) | def _make_file_system_target(file_path: str, processor: FileProcessor | ...
function _make_file_path (line 239) | def _make_file_path(original_path: str, unique_id: str | None = None) ->...
function _get_last_modification_time (line 246) | def _get_last_modification_time(path: str) -> datetime:
function make_target (line 254) | def make_target(
function make_model_target (line 268) | def make_model_target(
FILE: gokart/task.py
class EmptyDumpError (line 38) | class EmptyDumpError(AssertionError):
class TaskOnKart (line 42) | class TaskOnKart(luigi.Task, Generic[T]):
method priority (line 121) | def priority(self):
method __init__ (line 124) | def __init__(self, *args, **kwargs):
method input (line 148) | def input(self) -> FlattenableItems[TargetOnKart]:
method output (line 151) | def output(self) -> FlattenableItems[TargetOnKart]:
method requires (line 154) | def requires(self) -> FlattenableItems[TaskOnKart[Any]]:
method make_task_instance_dictionary (line 158) | def make_task_instance_dictionary(self) -> dict[str, TaskOnKart[Any]]:
method is_task_on_kart (line 162) | def is_task_on_kart(value):
method _add_configuration (line 166) | def _add_configuration(cls, kwargs, section):
method complete (line 176) | def complete(self) -> bool:
method _check_modification_time (line 195) | def _check_modification_time(self) -> bool:
method clone (line 209) | def clone(self, cls=None, **kwargs):
method make_target (line 223) | def make_target(self, relative_file_path: str | None = None, use_uniqu...
method _create_processor_for_dataframe_type (line 247) | def _create_processor_for_dataframe_type(self, file_path: str) -> File...
method make_large_data_frame_target (line 251) | def make_large_data_frame_target(self, relative_file_path: str | None ...
method make_model_target (line 275) | def make_model_target(
method load (line 308) | def load(self, target: None | str | TargetOnKart = None) -> Any: ...
method load (line 311) | def load(self, target: TaskOnKart[K]) -> K: ...
method load (line 314) | def load(self, target: list[TaskOnKart[K]]) -> list[K]: ...
method load (line 316) | def load(self, target: None | str | TargetOnKart | TaskOnKart[K] | lis...
method load_generator (line 327) | def load_generator(self, target: None | str | TargetOnKart = None) -> ...
method load_generator (line 330) | def load_generator(self, target: list[TaskOnKart[K]]) -> Generator[K, ...
method load_generator (line 332) | def load_generator(self, target: None | str | TargetOnKart | list[Task...
method dump (line 346) | def dump(self, obj: T, target: None = None, custom_labels: dict[Any, A...
method dump (line 349) | def dump(self, obj: Any, target: str | TargetOnKart, custom_labels: di...
method dump (line 351) | def dump(self, obj: Any, target: None | str | TargetOnKart = None, cus...
method get_code (line 371) | def get_code(target_class: Any) -> set[str]:
method get_own_code (line 377) | def get_own_code(self):
method make_unique_id (line 382) | def make_unique_id(self) -> str:
method _make_hash_id (line 388) | def _make_hash_id(self) -> str:
method _get_input_targets (line 406) | def _get_input_targets(self, target: None | str | TargetOnKart | TaskO...
method _get_output_target (line 422) | def _get_output_target(self, target: None | str | TargetOnKart) -> Tar...
method get_info (line 435) | def get_info(self, only_significant=False):
method _get_task_log_target (line 446) | def _get_task_log_target(self):
method get_task_log (line 449) | def get_task_log(self) -> dict[str, Any]:
method _dump_task_log (line 458) | def _dump_task_log(self):
method _get_task_params_target (line 463) | def _get_task_params_target(self):
method get_task_params (line 466) | def get_task_params(self) -> dict[str, Any]:
method _set_random_seed (line 473) | def _set_random_seed(self):
method _get_random_seeds_target (line 479) | def _get_random_seeds_target(self):
method try_set_seed (line 483) | def try_set_seed(methods: list[str], random_seed: int) -> list[str]:
method _get_random_seed (line 499) | def _get_random_seed(self):
method _dump_task_params (line 505) | def _dump_task_params(self):
method _get_processing_time_target (line 509) | def _get_processing_time_target(self):
method get_processing_time (line 512) | def get_processing_time(self) -> str:
method _dump_processing_time (line 519) | def _dump_processing_time(self, processing_time):
method restore (line 524) | def restore(cls, unique_id):
method _log_unique_id (line 529) | def _log_unique_id(self, exception):
method _dump_module_versions (line 533) | def _dump_module_versions(self):
method _get_module_versions_target (line 537) | def _get_module_versions_target(self):
method _get_module_versions (line 540) | def _get_module_versions(self) -> str:
method __repr__ (line 552) | def __repr__(self):
method __str__ (line 559) | def __str__(self):
method _get_task_string (line 567) | def _get_task_string(self, only_public=False):
method _make_representation (line 585) | def _make_representation(self, param_obj: luigi.Parameter, param_value...
FILE: gokart/task_complete_check.py
function task_complete_check_wrapper (line 11) | def task_complete_check_wrapper(run_func: Callable[..., Any], complete_c...
FILE: gokart/testing/check_if_run_with_empty_data_frame.py
class test_run (line 18) | class test_run(gokart.TaskOnKart[Any]):
class _TestStatus (line 25) | class _TestStatus:
method __init__ (line 26) | def __init__(self, task: gokart.TaskOnKart[Any]) -> None:
method format (line 33) | def format(self) -> str:
method fail (line 39) | def fail(self) -> bool:
function _get_all_tasks (line 43) | def _get_all_tasks(task: gokart.TaskOnKart[Any]) -> list[gokart.TaskOnKa...
function _run_with_test_status (line 50) | def _run_with_test_status(task: gokart.TaskOnKart[Any]) -> _TestStatus:
function _test_run_with_empty_data_frame (line 60) | def _test_run_with_empty_data_frame(cmdline_args: list[str], test_run_pa...
function try_to_run_test_for_empty_data_frame (line 82) | def try_to_run_test_for_empty_data_frame(cmdline_args: list[str]) -> None:
FILE: gokart/testing/pandas_assert.py
function assert_frame_contents_equal (line 8) | def assert_frame_contents_equal(actual: pd.DataFrame, expected: pd.DataF...
FILE: gokart/tree/task_info.py
function make_task_info_as_tree_str (line 13) | def make_task_info_as_tree_str(task: TaskOnKart[Any], details: bool = Fa...
function make_task_info_as_table (line 37) | def make_task_info_as_table(task: TaskOnKart[Any], ignore_task_names: li...
function dump_task_info_table (line 58) | def dump_task_info_table(task: TaskOnKart[Any], task_info_dump_path: str...
function dump_task_info_tree (line 83) | def dump_task_info_tree(task: TaskOnKart[Any], task_info_dump_path: str,...
FILE: gokart/tree/task_info_formatter.py
class TaskInfo (line 13) | class TaskInfo:
method get_task_id (line 24) | def get_task_id(self):
method get_task_title (line 27) | def get_task_title(self):
method get_task_detail (line 30) | def get_task_detail(self):
method task_info_dict (line 33) | def task_info_dict(self):
class RequiredTask (line 46) | class RequiredTask(NamedTuple):
function _make_requires_info (line 51) | def _make_requires_info(requires):
function make_task_info_tree (line 62) | def make_task_info_tree(task: TaskOnKart[Any], ignore_task_names: list[s...
function make_tree_info (line 104) | def make_tree_info(task_info: TaskInfo, indent: str, last: bool, details...
function make_tree_info_table_list (line 131) | def make_tree_info_table_list(task_info: TaskInfo, visited_tasks: set[st...
FILE: gokart/utils.py
class FileLike (line 13) | class FileLike(Protocol):
method read (line 14) | def read(self, n: int) -> bytes: ...
method readline (line 16) | def readline(self) -> bytes: ...
method seek (line 18) | def seek(self, offset: int) -> None: ...
method seekable (line 20) | def seekable(self) -> bool: ...
function add_config (line 23) | def add_config(file_path: str) -> None:
function flatten (line 33) | def flatten(targets: FlattenableItems[T]) -> list[T]:
function map_flattenable_items (line 72) | def map_flattenable_items(func: Callable[[T], K], items: FlattenableItem...
function load_dill_with_pandas_backward_compatibility (line 84) | def load_dill_with_pandas_backward_compatibility(file: FileLike | BytesI...
function get_dataframe_type_from_task (line 97) | def get_dataframe_type_from_task(task: Any) -> Literal['pandas', 'polars...
FILE: gokart/worker.py
function _is_external (line 88) | def _is_external(task: Task) -> bool:
function _get_retry_policy_dict (line 92) | def _get_retry_policy_dict(task: Task) -> dict[str, Any]:
class TaskProcess (line 109) | class TaskProcess(_ForkProcess): # type: ignore[valid-type, misc]
method __init__ (line 124) | def __init__(
method _run_task (line 153) | def _run_task(self) -> collections.abc.Generator[Any, Any, Any] | None:
method _run_get_new_deps (line 159) | def _run_get_new_deps(self) -> list[tuple[str, str, dict[str, str]]] |...
method run (line 187) | def run(self) -> None:
method _handle_run_exception (line 261) | def _handle_run_exception(self, ex: BaseException) -> str:
method _recursive_terminate (line 266) | def _recursive_terminate(self) -> None:
method terminate (line 286) | def terminate(self) -> None:
method _forward_attributes (line 295) | def _forward_attributes(self):
class ContextManagedTaskProcess (line 309) | class ContextManagedTaskProcess(TaskProcess):
method __init__ (line 310) | def __init__(self, context: Any, *args: Any, **kwargs: Any) -> None:
method run (line 314) | def run(self) -> None:
class gokart_worker (line 327) | class gokart_worker(luigi.Config):
class Worker (line 411) | class Worker:
method __init__ (line 421) | def __init__(
method _add_task (line 485) | def _add_task(self, *args, **kwargs):
method __enter__ (line 509) | def __enter__(self) -> Worker:
method __exit__ (line 518) | def __exit__(self, type: Any, value: Any, traceback: Any) -> Literal[F...
method _generate_worker_info (line 530) | def _generate_worker_info(self) -> list[tuple[str, Any]]:
method _generate_worker_id (line 554) | def _generate_worker_id(self, worker_info: list[Any]) -> str:
method _validate_task (line 558) | def _validate_task(self, task: Task) -> None:
method _log_complete_error (line 568) | def _log_complete_error(self, task: Task, tb: str) -> None:
method _log_dependency_error (line 572) | def _log_dependency_error(self, task: Task, tb: str) -> None:
method _log_unexpected_error (line 576) | def _log_unexpected_error(self, task: Task) -> None:
method _announce_scheduling_failure (line 579) | def _announce_scheduling_failure(self, task: Task, expl: Any) -> None:
method _email_complete_error (line 594) | def _email_complete_error(self, task: Task, formatted_traceback: str) ...
method _email_dependency_error (line 604) | def _email_dependency_error(self, task: Task, formatted_traceback: str...
method _email_unexpected_error (line 614) | def _email_unexpected_error(self, task: Task, formatted_traceback: str...
method _email_task_failure (line 625) | def _email_task_failure(self, task: Task, formatted_traceback: str) ->...
method _email_error (line 634) | def _email_error(self, task: Task, formatted_traceback: str, subject: ...
method _handle_task_load_error (line 641) | def _handle_task_load_error(self, exception: Exception, task_ids: list...
method add (line 656) | def add(self, task: Task, multiprocess: bool = False, processes: int =...
method _add_task_batcher (line 702) | def _add_task_batcher(self, task: Task) -> None:
method _add (line 716) | def _add(self, task: Task, is_complete: bool) -> Generator[Task, None,...
method _validate_dependency (line 801) | def _validate_dependency(self, dependency: Task) -> None:
method _check_complete_value (line 807) | def _check_complete_value(self, is_complete: bool | luigi.worker.Trace...
method _add_worker (line 813) | def _add_worker(self) -> None:
method _log_remote_tasks (line 817) | def _log_remote_tasks(self, get_work_response: GetWorkResponse) -> None:
method _get_work_task_id (line 830) | def _get_work_task_id(self, get_work_response: dict[str, Any]) -> str ...
method _get_work (line 858) | def _get_work(self) -> GetWorkResponse:
method _run_task (line 910) | def _run_task(self, task_id: str) -> None:
method _create_task_process (line 929) | def _create_task_process(self, task):
method _purge_children (line 947) | def _purge_children(self) -> None:
method _handle_next_task (line 967) | def _handle_next_task(self) -> None:
method _sleeper (line 1035) | def _sleeper(self) -> Generator[None, None, None]:
method _keep_alive (line 1044) | def _keep_alive(self, get_work_response: Any) -> bool:
method handle_interrupt (line 1074) | def handle_interrupt(self, signum: int, _: Any) -> None:
method _start_phasing_out (line 1081) | def _start_phasing_out(self) -> None:
method run (line 1089) | def run(self) -> bool:
method _handle_rpc_message (line 1134) | def _handle_rpc_message(self, message: dict[str, Any]) -> None:
method set_worker_processes (line 1154) | def set_worker_processes(self, n: int) -> None:
method dispatch_scheduler_message (line 1162) | def dispatch_scheduler_message(self, task_id: str, message_id: str, co...
FILE: gokart/workspace_management.py
function _get_all_output_file_paths (line 15) | def _get_all_output_file_paths(task: gokart.TaskOnKart[Any]) -> list[str]:
function delete_local_unnecessary_outputs (line 22) | def delete_local_unnecessary_outputs(task: gokart.TaskOnKart[Any]) -> None:
FILE: gokart/zip_client.py
function _unzip_file (line 10) | def _unzip_file(fp: str | IO[bytes] | os.PathLike[str], extract_dir: str...
class ZipClient (line 16) | class ZipClient:
method exists (line 18) | def exists(self) -> bool:
method make_archive (line 22) | def make_archive(self) -> None:
method unpack_archive (line 26) | def unpack_archive(self) -> None:
method remove (line 30) | def remove(self) -> None:
method path (line 35) | def path(self) -> str:
class LocalZipClient (line 39) | class LocalZipClient(ZipClient):
method __init__ (line 40) | def __init__(self, file_path: str, temporary_directory: str) -> None:
method exists (line 44) | def exists(self) -> bool:
method make_archive (line 47) | def make_archive(self) -> None:
method unpack_archive (line 51) | def unpack_archive(self) -> None:
method remove (line 54) | def remove(self) -> None:
method path (line 58) | def path(self) -> str:
FILE: gokart/zip_client_util.py
function make_zip_client (line 7) | def make_zip_client(file_path: str, temporary_directory: str) -> ZipClient:
FILE: test/conflict_prevention_lock/test_task_lock.py
class TestRedisClient (line 10) | class TestRedisClient(unittest.TestCase):
method _get_randint (line 12) | def _get_randint(host, port):
method test_redis_client_is_singleton (line 15) | def test_redis_client_is_singleton(self):
class TestMakeRedisKey (line 29) | class TestMakeRedisKey(unittest.TestCase):
method test_make_redis_key (line 30) | def test_make_redis_key(self):
class TestMakeRedisParams (line 35) | class TestMakeRedisParams(unittest.TestCase):
method test_make_task_lock_params_with_valid_host (line 36) | def test_make_task_lock_params_with_valid_host(self):
method test_make_task_lock_params_with_no_host (line 51) | def test_make_task_lock_params_with_no_host(self):
method test_assert_when_redis_timeout_is_too_short (line 66) | def test_assert_when_redis_timeout_is_too_short(self):
class TestMakeTaskLockParamsForRun (line 77) | class TestMakeTaskLockParamsForRun(unittest.TestCase):
method test_make_task_lock_params_for_run (line 78) | def test_make_task_lock_params_for_run(self):
FILE: test/conflict_prevention_lock/test_task_lock_wrappers.py
function _sample_func_with_error (line 11) | def _sample_func_with_error(a: int, b: str) -> None:
function _sample_long_func (line 15) | def _sample_long_func(a: int, b: str) -> dict[str, int | str]:
class TestWrapDumpWithLock (line 20) | class TestWrapDumpWithLock(unittest.TestCase):
method test_no_redis (line 21) | def test_no_redis(self):
method test_use_redis (line 36) | def test_use_redis(self):
method test_if_func_is_skipped_when_cache_already_exists (line 54) | def test_if_func_is_skipped_when_cache_already_exists(self):
method test_check_lock_extended (line 69) | def test_check_lock_extended(self):
method test_lock_is_removed_after_func_is_finished (line 83) | def test_lock_is_removed_after_func_is_finished(self):
method test_lock_is_removed_after_func_is_finished_with_error (line 107) | def test_lock_is_removed_after_func_is_finished_with_error(self):
class TestWrapLoadWithLock (line 127) | class TestWrapLoadWithLock(unittest.TestCase):
method test_no_redis (line 128) | def test_no_redis(self):
method test_use_redis (line 145) | def test_use_redis(self):
method test_check_lock_extended (line 165) | def test_check_lock_extended(self):
method test_lock_is_removed_after_func_is_finished (line 181) | def test_lock_is_removed_after_func_is_finished(self):
method test_lock_is_removed_after_func_is_finished_with_error (line 206) | def test_lock_is_removed_after_func_is_finished_with_error(self):
class TestWrapRemoveWithLock (line 226) | class TestWrapRemoveWithLock(unittest.TestCase):
method test_no_redis (line 227) | def test_no_redis(self):
method test_use_redis (line 243) | def test_use_redis(self):
method test_check_lock_extended (line 262) | def test_check_lock_extended(self):
method test_lock_is_removed_after_func_is_finished (line 278) | def test_lock_is_removed_after_func_is_finished(self):
method test_lock_is_removed_after_func_is_finished_with_error (line 302) | def test_lock_is_removed_after_func_is_finished_with_error(self):
FILE: test/file_processor/test_base.py
class TestPickleFileProcessor (line 18) | class TestPickleFileProcessor(unittest.TestCase):
method test_dump_and_load_normal_obj (line 19) | def test_dump_and_load_normal_obj(self):
method test_dump_and_load_class (line 33) | def test_dump_and_load_class(self):
method test_dump_and_load_with_readables3file (line 64) | def test_dump_and_load_with_readables3file(self):
FILE: test/file_processor/test_factory.py
class TestMakeFileProcessor (line 19) | class TestMakeFileProcessor(unittest.TestCase):
method test_make_file_processor_with_txt_extension (line 20) | def test_make_file_processor_with_txt_extension(self):
method test_make_file_processor_with_csv_extension (line 24) | def test_make_file_processor_with_csv_extension(self):
method test_make_file_processor_with_gz_extension (line 28) | def test_make_file_processor_with_gz_extension(self):
method test_make_file_processor_with_json_extension (line 32) | def test_make_file_processor_with_json_extension(self):
method test_make_file_processor_with_ndjson_extension (line 36) | def test_make_file_processor_with_ndjson_extension(self):
method test_make_file_processor_with_npz_extension (line 40) | def test_make_file_processor_with_npz_extension(self):
method test_make_file_processor_with_parquet_extension (line 44) | def test_make_file_processor_with_parquet_extension(self):
method test_make_file_processor_with_feather_extension (line 48) | def test_make_file_processor_with_feather_extension(self):
method test_make_file_processor_with_unsupported_extension (line 52) | def test_make_file_processor_with_unsupported_extension(self):
FILE: test/file_processor/test_pandas.py
class TestCsvFileProcessor (line 15) | class TestCsvFileProcessor(unittest.TestCase):
method test_dump_csv_with_utf8 (line 16) | def test_dump_csv_with_utf8(self):
method test_dump_csv_with_cp932 (line 31) | def test_dump_csv_with_cp932(self):
method test_load_csv_with_utf8 (line 46) | def test_load_csv_with_utf8(self):
method test_load_csv_with_cp932 (line 60) | def test_load_csv_with_cp932(self):
class TestJsonFileProcessor (line 75) | class TestJsonFileProcessor:
method test_dump_and_load_json (line 97) | def test_dump_and_load_json(self, orient, input_data, expected_json):
class TestFeatherFileProcessor (line 116) | class TestFeatherFileProcessor(unittest.TestCase):
method test_feather_should_return_same_dataframe (line 117) | def test_feather_should_return_same_dataframe(self):
method test_feather_should_save_index_name (line 133) | def test_feather_should_save_index_name(self):
method test_feather_should_raise_error_index_name_is_None (line 149) | def test_feather_should_raise_error_index_name_is_None(self):
FILE: test/file_processor/test_polars.py
class TestCsvFileProcessorWithPolars (line 26) | class TestCsvFileProcessorWithPolars:
method test_dump_polars_dataframe (line 29) | def test_dump_polars_dataframe(self):
method test_load_polars_dataframe (line 45) | def test_load_polars_dataframe(self):
method test_dump_and_load_polars_roundtrip (line 61) | def test_dump_and_load_polars_roundtrip(self):
method test_dump_polars_with_pandas_load (line 79) | def test_dump_polars_with_pandas_load(self):
method test_polars_with_different_separator (line 101) | def test_polars_with_different_separator(self):
method test_error_when_polars_not_available_for_load (line 119) | def test_error_when_polars_not_available_for_load(self):
class TestJsonFileProcessorWithPolars (line 128) | class TestJsonFileProcessorWithPolars:
method test_dump_polars_dataframe (line 131) | def test_dump_polars_dataframe(self):
method test_load_polars_dataframe (line 147) | def test_load_polars_dataframe(self):
method test_dump_and_load_polars_roundtrip (line 163) | def test_dump_and_load_polars_roundtrip(self):
method test_dump_and_load_ndjson_with_polars (line 181) | def test_dump_and_load_ndjson_with_polars(self):
method test_dump_polars_with_pandas_load (line 199) | def test_dump_polars_with_pandas_load(self):
class TestParquetFileProcessorWithPolars (line 224) | class TestParquetFileProcessorWithPolars:
method test_dump_polars_dataframe (line 227) | def test_dump_polars_dataframe(self):
method test_load_polars_dataframe (line 243) | def test_load_polars_dataframe(self):
method test_dump_and_load_polars_roundtrip (line 259) | def test_dump_and_load_polars_roundtrip(self):
method test_dump_polars_with_pandas_load (line 277) | def test_dump_polars_with_pandas_load(self):
method test_parquet_with_compression (line 298) | def test_parquet_with_compression(self):
class TestFeatherFileProcessorWithPolars (line 318) | class TestFeatherFileProcessorWithPolars:
method test_dump_polars_dataframe (line 321) | def test_dump_polars_dataframe(self):
method test_load_polars_dataframe (line 337) | def test_load_polars_dataframe(self):
method test_dump_and_load_polars_roundtrip (line 353) | def test_dump_and_load_polars_roundtrip(self):
method test_dump_polars_with_pandas_load (line 371) | def test_dump_polars_with_pandas_load(self):
class TestLazyFrameSupport (line 395) | class TestLazyFrameSupport:
method test_csv_load_lazy (line 398) | def test_csv_load_lazy(self):
method test_csv_dump_lazyframe (line 414) | def test_csv_dump_lazyframe(self):
method test_parquet_load_lazy (line 430) | def test_parquet_load_lazy(self):
method test_parquet_dump_lazyframe (line 446) | def test_parquet_dump_lazyframe(self):
method test_feather_load_lazy (line 462) | def test_feather_load_lazy(self):
method test_feather_dump_lazyframe (line 478) | def test_feather_dump_lazyframe(self):
method test_json_load_lazy_ndjson (line 494) | def test_json_load_lazy_ndjson(self):
method test_json_dump_lazyframe_ndjson (line 510) | def test_json_dump_lazyframe_ndjson(self):
method test_json_load_lazy_standard (line 526) | def test_json_load_lazy_standard(self):
method test_json_dump_lazyframe_standard (line 542) | def test_json_dump_lazyframe_standard(self):
method test_polars_returns_dataframe (line 558) | def test_polars_returns_dataframe(self):
FILE: test/in_memory/test_in_memory_target.py
class TestInMemoryTarget (line 10) | class TestInMemoryTarget:
method task_lock_params (line 12) | def task_lock_params(self) -> TaskLockParams:
method target (line 24) | def target(self, task_lock_params: TaskLockParams) -> InMemoryTarget:
method clear_repo (line 28) | def clear_repo(self) -> None:
method test_dump_and_load_data (line 31) | def test_dump_and_load_data(self, target: InMemoryTarget) -> None:
method test_exist (line 37) | def test_exist(self, target: InMemoryTarget) -> None:
method test_last_modified_time (line 42) | def test_last_modified_time(self, target: InMemoryTarget) -> None:
FILE: test/in_memory/test_repository.py
class TestInMemoryCacheRepository (line 10) | class TestInMemoryCacheRepository:
method repo (line 12) | def repo(self) -> Repo:
method test_set (line 17) | def test_set(self, repo: Repo) -> None:
method test_get (line 26) | def test_get(self, repo: Repo) -> None:
method test_empty (line 37) | def test_empty(self, repo: Repo) -> None:
method test_has (line 42) | def test_has(self, repo: Repo) -> None:
method test_remove (line 48) | def test_remove(self, repo: Repo) -> None:
method test_last_modification_time (line 57) | def test_last_modification_time(self, repo: Repo) -> None:
FILE: test/slack/test_slack_api.py
function _slack_response (line 15) | def _slack_response(token, data):
class TestSlackAPI (line 21) | class TestSlackAPI(unittest.TestCase):
method test_initialization_with_invalid_token (line 23) | def test_initialization_with_invalid_token(self, patch):
method test_invalid_channel (line 36) | def test_invalid_channel(self, patch):
method test_send_snippet_with_invalid_token (line 53) | def test_send_snippet_with_invalid_token(self, patch):
method test_send (line 76) | def test_send(self, patch):
FILE: test/test_build.py
class _DummyTask (line 26) | class _DummyTask(gokart.TaskOnKart[str]):
method output (line 30) | def output(self):
method run (line 33) | def run(self):
class _DummyTaskTwoOutputs (line 37) | class _DummyTaskTwoOutputs(gokart.TaskOnKart[dict[str, str]]):
method output (line 42) | def output(self):
method run (line 45) | def run(self):
class _DummyFailedTask (line 50) | class _DummyFailedTask(gokart.TaskOnKart[Any]):
method run (line 53) | def run(self):
class _ParallelRunner (line 57) | class _ParallelRunner(gokart.TaskOnKart[str]):
method requires (line 58) | def requires(self):
method run (line 61) | def run(self):
class _LoadRequires (line 65) | class _LoadRequires(gokart.TaskOnKart[str]):
method requires (line 68) | def requires(self):
method run (line 71) | def run(self):
class RunTest (line 76) | class RunTest(unittest.TestCase):
method setUp (line 77) | def setUp(self):
method tearDown (line 85) | def tearDown(self):
method test_build (line 91) | def test_build(self):
method test_build_parallel (line 96) | def test_build_parallel(self):
method test_read_config (line 100) | def test_read_config(self):
method test_build_dict_outputs (line 115) | def test_build_dict_outputs(self):
method test_failed_task (line 124) | def test_failed_task(self):
method test_load_requires (line 128) | def test_load_requires(self):
method test_build_with_child_task_error (line 133) | def test_build_with_child_task_error(self):
class LoggerConfigTest (line 149) | class LoggerConfigTest(unittest.TestCase):
method test_logger_config (line 150) | def test_logger_config(self):
class ProcessTaskInfoTest (line 162) | class ProcessTaskInfoTest(unittest.TestCase):
method test_process_task_info (line 163) | def test_process_task_info(self):
class _FailThreeTimesAndSuccessTask (line 183) | class _FailThreeTimesAndSuccessTask(gokart.TaskOnKart[Any]):
method __init__ (line 184) | def __init__(self, *args, **kwargs):
method run (line 188) | def run(self):
class TestBuildHasLockedTaskException (line 195) | class TestBuildHasLockedTaskException(unittest.TestCase):
method test_build_expo_backoff_when_luigi_failed_due_to_locked_task (line 196) | def test_build_expo_backoff_when_luigi_failed_due_to_locked_task(self):
class TestBuildFailedAndSchedulingFailed (line 200) | class TestBuildFailedAndSchedulingFailed(unittest.TestCase):
method test_build_raises_exception_on_failed_and_scheduling_failed (line 201) | def test_build_raises_exception_on_failed_and_scheduling_failed(self):
method test_build_not_raises_exception_when_success_with_retry (line 218) | def test_build_not_raises_exception_when_success_with_retry(self):
method test_build_not_raises_exception_on_scheduling_failed_only (line 238) | def test_build_not_raises_exception_on_scheduling_failed_only(self):
FILE: test/test_cache_unique_id.py
class _DummyTask (line 11) | class _DummyTask(gokart.TaskOnKart[Any]):
method requires (line 12) | def requires(self):
method run (line 15) | def run(self):
class _DummyTaskDep (line 19) | class _DummyTaskDep(gokart.TaskOnKart[str]):
method run (line 22) | def run(self):
class CacheUniqueIDTest (line 26) | class CacheUniqueIDTest(unittest.TestCase):
method setUp (line 27) | def setUp(self):
method _set_param (line 33) | def _set_param(cls, attr_name: str, param: luigi.Parameter) -> None: ...
method test_cache_unique_id_true (line 39) | def test_cache_unique_id_true(self):
method test_cache_unique_id_false (line 48) | def test_cache_unique_id_false(self):
FILE: test/test_config_params.py
function in_parse (line 11) | def in_parse(cmds, deferred_computation):
class ConfigClass (line 17) | class ConfigClass(luigi.Config):
class Inherited (line 24) | class Inherited(gokart.TaskOnKart[Any]):
class Inherited2 (line 30) | class Inherited2(gokart.TaskOnKart[Any]):
class ChildTask (line 35) | class ChildTask(Inherited):
class ChildTaskWithNewParam (line 39) | class ChildTaskWithNewParam(Inherited):
class ConfigClass2 (line 43) | class ConfigClass2(luigi.Config):
class ChildTaskWithNewConfig (line 48) | class ChildTaskWithNewConfig(Inherited):
class TestInheritsConfigParam (line 52) | class TestInheritsConfigParam(unittest.TestCase):
method test_inherited_params (line 53) | def test_inherited_params(self):
method test_child_task (line 71) | def test_child_task(self):
method test_child_override (line 78) | def test_child_override(self):
FILE: test/test_explicit_bool_parameter.py
function in_parse (line 11) | def in_parse(cmds, deferred_computation):
class WithDefaultTrue (line 16) | class WithDefaultTrue(gokart.TaskOnKart[Any]):
class WithDefaultFalse (line 20) | class WithDefaultFalse(gokart.TaskOnKart[Any]):
class ExplicitParsing (line 24) | class ExplicitParsing(gokart.TaskOnKart[Any]):
method run (line 27) | def run(self):
class TestExplicitBoolParameter (line 31) | class TestExplicitBoolParameter(unittest.TestCase):
method test_bool_default (line 32) | def test_bool_default(self):
method test_parse_param (line 36) | def test_parse_param(self):
method test_missing_parameter (line 42) | def test_missing_parameter(self):
method test_value_error (line 46) | def test_value_error(self):
method test_expected_one_argment_error (line 50) | def test_expected_one_argment_error(self):
FILE: test/test_gcs_config.py
class TestGCSConfig (line 8) | class TestGCSConfig(unittest.TestCase):
method test_get_gcs_client_without_gcs_credential_name (line 9) | def test_get_gcs_client_without_gcs_credential_name(self):
method test_get_gcs_client_with_file_path (line 16) | def test_get_gcs_client_with_file_path(self):
method test_get_gcs_client_with_json (line 26) | def test_get_gcs_client_with_json(self):
FILE: test/test_gcs_obj_metadata_client.py
class _DummyTaskOnKart (line 14) | class _DummyTaskOnKart(gokart.TaskOnKart[str]):
method run (line 17) | def run(self):
class TestGCSObjectMetadataClient (line 21) | class TestGCSObjectMetadataClient(unittest.TestCase):
method setUp (line 22) | def setUp(self):
method test_normalize_labels_not_empty (line 44) | def test_normalize_labels_not_empty(self):
method test_normalize_labels_has_value (line 48) | def test_normalize_labels_has_value(self):
method test_get_patched_obj_metadata_only_task_params (line 60) | def test_get_patched_obj_metadata_only_task_params(self):
method test_get_patched_obj_metadata_only_custom_labels (line 71) | def test_get_patched_obj_metadata_only_custom_labels(self):
method test_get_patched_obj_metadata_with_both_task_params_and_custom_labels (line 80) | def test_get_patched_obj_metadata_with_both_task_params_and_custom_lab...
method test_get_patched_obj_metadata_with_exceeded_size_metadata (line 95) | def test_get_patched_obj_metadata_with_exceeded_size_metadata(self):
method test_get_patched_obj_metadata_with_conflicts (line 106) | def test_get_patched_obj_metadata_with_conflicts(self):
method test_get_patched_obj_metadata_with_required_task_outputs (line 117) | def test_get_patched_obj_metadata_with_required_task_outputs(self):
method test_get_patched_obj_metadata_with_nested_required_task_outputs (line 129) | def test_get_patched_obj_metadata_with_nested_required_task_outputs(se...
method test_adjust_gcs_metadata_limit_size_runtime_error (line 143) | def test_adjust_gcs_metadata_limit_size_runtime_error(self):
class TestGokartTask (line 151) | class TestGokartTask(unittest.TestCase):
method test_mock_target_on_kart (line 153) | def test_mock_target_on_kart(self, mock_get_output_target):
FILE: test/test_info.py
class TestInfo (line 13) | class TestInfo(unittest.TestCase):
method setUp (line 14) | def setUp(self) -> None:
method tearDown (line 19) | def tearDown(self) -> None:
method test_make_tree_info_pending (line 24) | def test_make_tree_info_pending(self):
method test_make_tree_info_complete (line 35) | def test_make_tree_info_complete(self):
method test_make_tree_info_abbreviation (line 47) | def test_make_tree_info_abbreviation(self):
method test_make_tree_info_not_compress (line 65) | def test_make_tree_info_not_compress(self):
method test_make_tree_info_not_compress_ignore_task (line 83) | def test_make_tree_info_not_compress_ignore_task(self):
FILE: test/test_large_data_fram_processor.py
class LargeDataFrameProcessorTest (line 12) | class LargeDataFrameProcessorTest(unittest.TestCase):
method setUp (line 13) | def setUp(self):
method tearDown (line 16) | def tearDown(self):
method test_save_and_load (line 19) | def test_save_and_load(self):
method test_save_and_load_empty (line 28) | def test_save_and_load_empty(self):
FILE: test/test_list_task_instance_parameter.py
class _DummySubTask (line 10) | class _DummySubTask(TaskOnKart[Any]):
class _DummyTask (line 15) | class _DummyTask(TaskOnKart[Any]):
class ListTaskInstanceParameterTest (line 21) | class ListTaskInstanceParameterTest(unittest.TestCase):
method setUp (line 22) | def setUp(self):
method test_serialize_and_parse (line 25) | def test_serialize_and_parse(self):
FILE: test/test_mypy.py
class TestMyMypyPlugin (line 9) | class TestMyMypyPlugin(unittest.TestCase):
method test_plugin_no_issue (line 10) | def test_plugin_no_issue(self):
method test_plugin_invalid_arg (line 40) | def test_plugin_invalid_arg(self):
FILE: test/test_pandas_type_check_framework.py
class TestPandasTypeConfig (line 20) | class TestPandasTypeConfig(PandasTypeConfig):
method type_dict (line 24) | def type_dict(cls) -> dict[str, Any]:
class _DummyFailTask (line 28) | class _DummyFailTask(gokart.TaskOnKart[pd.DataFrame]):
method output (line 32) | def output(self):
method run (line 35) | def run(self):
class _DummyFailWithNoneTask (line 40) | class _DummyFailWithNoneTask(gokart.TaskOnKart[pd.DataFrame]):
method output (line 44) | def output(self):
method run (line 47) | def run(self):
class _DummySuccessTask (line 52) | class _DummySuccessTask(gokart.TaskOnKart[pd.DataFrame]):
method output (line 56) | def output(self):
method run (line 59) | def run(self):
class TestPandasTypeCheckFramework (line 64) | class TestPandasTypeCheckFramework(unittest.TestCase):
method setUp (line 65) | def setUp(self) -> None:
method tearDown (line 72) | def tearDown(self) -> None:
method test_fail_with_gokart_run (line 79) | def test_fail_with_gokart_run(self):
method test_fail (line 84) | def test_fail(self):
method test_fail_with_None (line 88) | def test_fail_with_None(self):
method test_success (line 92) | def test_success(self):
FILE: test/test_pandas_type_config.py
class _DummyPandasTypeConfig (line 14) | class _DummyPandasTypeConfig(PandasTypeConfig):
method type_dict (line 16) | def type_dict(cls) -> dict[str, Any]:
class TestPandasTypeConfig (line 20) | class TestPandasTypeConfig(TestCase):
method test_int_fail (line 21) | def test_int_fail(self):
method test_int_success (line 26) | def test_int_success(self):
method test_datetime_fail (line 30) | def test_datetime_fail(self):
method test_datetime_success (line 35) | def test_datetime_success(self):
method test_array_fail (line 39) | def test_array_fail(self):
method test_array_success (line 44) | def test_array_success(self):
FILE: test/test_restore_task_by_id.py
class _SubDummyTask (line 11) | class _SubDummyTask(gokart.TaskOnKart[str]):
method run (line 15) | def run(self):
class _DummyTask (line 19) | class _DummyTask(gokart.TaskOnKart[str]):
method output (line 23) | def output(self):
method run (line 26) | def run(self):
class RestoreTaskByIDTest (line 30) | class RestoreTaskByIDTest(unittest.TestCase):
method setUp (line 31) | def setUp(self) -> None:
method test (line 35) | def test(self):
FILE: test/test_run.py
class _DummyTask (line 13) | class _DummyTask(gokart.TaskOnKart[Any]):
class RunTest (line 18) | class RunTest(unittest.TestCase):
method setUp (line 19) | def setUp(self):
method test_run (line 25) | def test_run(self):
method test_run_with_undefined_environ (line 34) | def test_run_with_undefined_environ(self):
method test_run_tree_info (line 54) | def test_run_tree_info(self):
method test_try_to_send_event_summary_to_slack (line 64) | def test_try_to_send_event_summary_to_slack(self, make_tree_info_mock:...
FILE: test/test_s3_config.py
class TestS3Config (line 6) | class TestS3Config(unittest.TestCase):
method test_get_same_s3_client (line 7) | def test_get_same_s3_client(self):
FILE: test/test_s3_zip_client.py
class TestS3ZipClient (line 12) | class TestS3ZipClient(unittest.TestCase):
method setUp (line 13) | def setUp(self):
method tearDown (line 16) | def tearDown(self):
method test_make_archive (line 24) | def test_make_archive(self):
method test_unpack_archive (line 41) | def test_unpack_archive(self):
FILE: test/test_serializable_parameter.py
class Config (line 16) | class Config:
method gokart_serialize (line 20) | def gokart_serialize(self) -> str:
method gokart_deserialize (line 25) | def gokart_deserialize(cls, s: str) -> 'Config':
class SerializableParameterWithOutDefault (line 29) | class SerializableParameterWithOutDefault(TaskOnKart[Any]):
method run (line 33) | def run(self):
class SerializableParameterWithDefault (line 37) | class SerializableParameterWithDefault(TaskOnKart[Any]):
method run (line 41) | def run(self):
class TestSerializableParameter (line 45) | class TestSerializableParameter:
method test_default (line 46) | def test_default(self):
method test_parse_param (line 50) | def test_parse_param(self):
method test_missing_parameter (line 54) | def test_missing_parameter(self):
method test_value_error (line 59) | def test_value_error(self):
method test_expected_one_argument_error (line 64) | def test_expected_one_argument_error(self):
method test_mypy (line 69) | def test_mypy(self):
FILE: test/test_target.py
class LocalTargetTest (line 19) | class LocalTargetTest(unittest.TestCase):
method setUp (line 20) | def setUp(self):
method tearDown (line 23) | def tearDown(self):
method test_save_and_load_pickle_file (line 26) | def test_save_and_load_pickle_file(self):
method test_save_and_load_text_file (line 38) | def test_save_and_load_text_file(self):
method test_save_and_load_gzip (line 48) | def test_save_and_load_gzip(self):
method test_save_and_load_npz (line 58) | def test_save_and_load_npz(self):
method test_save_and_load_figure (line 67) | def test_save_and_load_figure(self):
method test_save_and_load_csv (line 79) | def test_save_and_load_csv(self):
method test_save_and_load_tsv (line 89) | def test_save_and_load_tsv(self):
method test_save_and_load_parquet (line 99) | def test_save_and_load_parquet(self):
method test_save_and_load_feather (line 109) | def test_save_and_load_feather(self):
method test_save_and_load_feather_without_store_index_in_feather (line 119) | def test_save_and_load_feather_without_store_index_in_feather(self):
method test_last_modified_time (line 129) | def test_last_modified_time(self):
method test_last_modified_time_without_file (line 138) | def test_last_modified_time_without_file(self):
method test_save_pandas_series (line 144) | def test_save_pandas_series(self):
method test_dump_with_lock (line 154) | def test_dump_with_lock(self):
method test_dump_without_lock (line 163) | def test_dump_without_lock(self):
class S3TargetTest (line 173) | class S3TargetTest(unittest.TestCase):
method test_save_on_s3 (line 175) | def test_save_on_s3(self):
method test_last_modified_time (line 189) | def test_last_modified_time(self):
method test_last_modified_time_without_file (line 202) | def test_last_modified_time_without_file(self):
method test_save_on_s3_feather (line 212) | def test_save_on_s3_feather(self):
method test_save_on_s3_parquet (line 226) | def test_save_on_s3_parquet(self):
class ModelTargetTest (line 240) | class ModelTargetTest(unittest.TestCase):
method setUp (line 241) | def setUp(self):
method tearDown (line 244) | def tearDown(self):
method _save_function (line 248) | def _save_function(obj, path):
method _load_function (line 252) | def _load_function(path):
method test_model_target_on_local (line 255) | def test_model_target_on_local(self):
method test_model_target_on_s3 (line 269) | def test_model_target_on_s3(self):
FILE: test/test_task_instance_parameter.py
class _DummySubTask (line 10) | class _DummySubTask(TaskOnKart[Any]):
class _DummyCorrectSubClassTask (line 15) | class _DummyCorrectSubClassTask(_DummySubTask):
class _DummyInvalidSubClassTask (line 20) | class _DummyInvalidSubClassTask(TaskOnKart[Any]):
class _DummyTask (line 25) | class _DummyTask(TaskOnKart[Any]):
class _DummyListTask (line 31) | class _DummyListTask(TaskOnKart[Any]):
class TaskInstanceParameterTest (line 37) | class TaskInstanceParameterTest(unittest.TestCase):
method setUp (line 38) | def setUp(self):
method test_serialize_and_parse (line 41) | def test_serialize_and_parse(self):
method test_serialize_and_parse_list_params (line 47) | def test_serialize_and_parse_list_params(self):
method test_invalid_class (line 53) | def test_invalid_class(self):
method test_params_with_correct_param_type (line 56) | def test_params_with_correct_param_type(self):
method test_params_with_invalid_param_type (line 64) | def test_params_with_invalid_param_type(self):
class ListTaskInstanceParameterTest (line 73) | class ListTaskInstanceParameterTest(unittest.TestCase):
method setUp (line 74) | def setUp(self):
method test_invalid_class (line 77) | def test_invalid_class(self):
method test_list_params_with_correct_param_types (line 80) | def test_list_params_with_correct_param_types(self):
method test_list_params_with_invalid_param_types (line 88) | def test_list_params_with_invalid_param_types(self):
FILE: test/test_task_on_kart.py
class _DummyTask (line 22) | class _DummyTask(gokart.TaskOnKart[Any]):
method output (line 28) | def output(self):
class _DummyTaskA (line 32) | class _DummyTaskA(gokart.TaskOnKart[Any]):
method output (line 35) | def output(self):
class _DummyTaskB (line 40) | class _DummyTaskB(gokart.TaskOnKart[Any]):
method output (line 43) | def output(self):
method requires (line 46) | def requires(self):
class _DummyTaskC (line 51) | class _DummyTaskC(gokart.TaskOnKart[Any]):
method output (line 54) | def output(self):
method requires (line 57) | def requires(self):
class _DummyTaskD (line 61) | class _DummyTaskD(gokart.TaskOnKart[Any]):
class _DummyTaskWithoutLock (line 65) | class _DummyTaskWithoutLock(gokart.TaskOnKart[Any]):
method run (line 68) | def run(self):
class _DummySubTaskWithPrivateParameter (line 72) | class _DummySubTaskWithPrivateParameter(gokart.TaskOnKart[Any]):
class _DummyTaskWithPrivateParameter (line 76) | class _DummyTaskWithPrivateParameter(gokart.TaskOnKart[Any]):
class TaskTest (line 84) | class TaskTest(unittest.TestCase):
method setUp (line 85) | def setUp(self):
method test_complete_without_dependency (line 91) | def test_complete_without_dependency(self):
method test_complete_with_rerun_flag (line 95) | def test_complete_with_rerun_flag(self):
method test_complete_with_uncompleted_input (line 100) | def test_complete_with_uncompleted_input(self):
method test_complete_with_modified_input (line 113) | def test_complete_with_modified_input(self):
method test_complete_when_modification_time_equals_output (line 132) | def test_complete_when_modification_time_equals_output(self):
method test_complete_when_input_and_output_equal (line 149) | def test_complete_when_input_and_output_equal(self):
method test_default_target (line 175) | def test_default_target(self):
method test_clone_with_special_params (line 181) | def test_clone_with_special_params(self):
method test_default_large_dataframe_target (line 193) | def test_default_large_dataframe_target(self):
method test_make_target (line 200) | def test_make_target(self):
method test_make_target_without_id (line 205) | def test_make_target_without_id(self):
method test_make_target_with_processor (line 209) | def test_make_target_with_processor(self):
method test_get_own_code (line 217) | def test_get_own_code(self):
method test_make_unique_id_with_own_code (line 222) | def test_make_unique_id_with_own_code(self):
method test_compare_targets_of_different_tasks (line 245) | def test_compare_targets_of_different_tasks(self):
method test_make_model_target (line 250) | def test_make_model_target(self):
method test_load_with_single_target (line 255) | def test_load_with_single_target(self):
method test_load_with_single_dict_target (line 265) | def test_load_with_single_dict_target(self):
method test_load_with_keyword (line 275) | def test_load_with_keyword(self):
method test_load_tuple (line 285) | def test_load_tuple(self):
method test_load_dictionary_at_once (line 299) | def test_load_dictionary_at_once(self):
method test_load_with_task_on_kart (line 313) | def test_load_with_task_on_kart(self):
method test_load_with_task_on_kart_should_fail_when_task_on_kart_is_not_in_requires (line 328) | def test_load_with_task_on_kart_should_fail_when_task_on_kart_is_not_i...
method test_load_with_task_on_kart_list (line 342) | def test_load_with_task_on_kart_list(self):
method test_load_generator_with_single_target (line 364) | def test_load_generator_with_single_target(self):
method test_load_generator_with_keyword (line 372) | def test_load_generator_with_keyword(self):
method test_load_generator_with_list_task_on_kart (line 380) | def test_load_generator_with_list_task_on_kart(self):
method test_dump (line 402) | def test_dump(self):
method test_fail_on_empty_dump (line 410) | def test_fail_on_empty_dump(self):
method test_add_configuration (line 423) | def test_add_configuration(self, mock_config: Mock) -> None:
method test_add_cofigureation_evaluation_order (line 432) | def test_add_cofigureation_evaluation_order(self, mock_cmdline: Mock) ...
method test_use_rerun_with_inherits (line 448) | def test_use_rerun_with_inherits(self):
method test_significant_flag (line 467) | def test_significant_flag(self) -> None:
method test_default_requires (line 489) | def test_default_requires(self):
method test_repr (line 503) | def test_repr(self):
method test_str (line 518) | def test_str(self):
method test_is_task_on_kart (line 533) | def test_is_task_on_kart(self):
method test_serialize_and_deserialize_default_values (line 539) | def test_serialize_and_deserialize_default_values(self):
method test_to_str_params_changes_on_values_and_flags (line 544) | def test_to_str_params_changes_on_values_and_flags(self):
method test_should_lock_run_when_set (line 555) | def test_should_lock_run_when_set(self):
method test_should_fail_lock_run_when_host_unset (line 563) | def test_should_fail_lock_run_when_host_unset(self):
method test_should_fail_lock_run_when_port_unset (line 567) | def test_should_fail_lock_run_when_port_unset(self):
class _DummyTaskWithNonCompleted (line 572) | class _DummyTaskWithNonCompleted(gokart.TaskOnKart[Any]):
method dump (line 573) | def dump(self, _obj: Any, _target: Any = None, _custom_labels: Any = N...
method run (line 577) | def run(self):
method complete (line 580) | def complete(self):
class _DummyTaskWithCompleted (line 584) | class _DummyTaskWithCompleted(gokart.TaskOnKart[Any]):
method dump (line 585) | def dump(self, obj: Any, _target: Any = None, custom_labels: Any = Non...
method run (line 589) | def run(self):
method complete (line 592) | def complete(self):
class TestCompleteCheckAtRun (line 596) | class TestCompleteCheckAtRun(unittest.TestCase):
method test_run_when_complete_check_at_run_is_false_and_task_is_not_completed (line 597) | def test_run_when_complete_check_at_run_is_false_and_task_is_not_compl...
method test_run_when_complete_check_at_run_is_false_and_task_is_completed (line 605) | def test_run_when_complete_check_at_run_is_false_and_task_is_completed...
method test_run_when_complete_check_at_run_is_true_and_task_is_not_completed (line 613) | def test_run_when_complete_check_at_run_is_true_and_task_is_not_comple...
method test_run_when_complete_check_at_run_is_true_and_task_is_completed (line 621) | def test_run_when_complete_check_at_run_is_true_and_task_is_completed(...
FILE: test/test_utils.py
class TestFlatten (line 21) | class TestFlatten(unittest.TestCase):
method test_flatten_dict (line 22) | def test_flatten_dict(self):
method test_flatten_list (line 25) | def test_flatten_list(self):
method test_flatten_str (line 28) | def test_flatten_str(self):
method test_flatten_int (line 31) | def test_flatten_int(self):
method test_flatten_none (line 34) | def test_flatten_none(self):
class TestMapFlatten (line 38) | class TestMapFlatten(unittest.TestCase):
method test_map_flattenable_items (line 39) | def test_map_flattenable_items(self):
class TestGetDataFrameTypeFromTask (line 54) | class TestGetDataFrameTypeFromTask(unittest.TestCase):
method test_pandas_dataframe_from_instance (line 57) | def test_pandas_dataframe_from_instance(self):
method test_pandas_dataframe_from_class (line 66) | def test_pandas_dataframe_from_class(self):
method test_polars_dataframe_from_instance (line 75) | def test_polars_dataframe_from_instance(self):
method test_polars_dataframe_from_class (line 85) | def test_polars_dataframe_from_class(self):
method test_no_type_parameter_defaults_to_pandas (line 93) | def test_no_type_parameter_defaults_to_pandas(self):
method test_non_taskonkart_class_defaults_to_pandas (line 103) | def test_non_taskonkart_class_defaults_to_pandas(self):
method test_taskonkart_with_non_dataframe_type (line 112) | def test_taskonkart_with_non_dataframe_type(self):
method test_nested_inheritance_pandas (line 122) | def test_nested_inheritance_pandas(self):
method test_nested_inheritance_polars (line 137) | def test_nested_inheritance_polars(self):
method test_polars_lazyframe_from_instance (line 151) | def test_polars_lazyframe_from_instance(self):
method test_polars_lazyframe_from_class (line 159) | def test_polars_lazyframe_from_class(self):
method test_nested_inheritance_polars_lazyframe (line 166) | def test_nested_inheritance_polars_lazyframe(self):
method test_nested_inheritance_polars_with_mixin (line 177) | def test_nested_inheritance_polars_with_mixin(self):
FILE: test/test_worker.py
class _DummyTask (line 13) | class _DummyTask(gokart.TaskOnKart[str]):
method _run (line 17) | def _run(self): ...
method run (line 19) | def run(self):
class TestWorkerRun (line 24) | class TestWorkerRun:
method test_run (line 25) | def test_run(self, monkeypatch: pytest.MonkeyPatch) -> None:
class _DummyTaskToCheckSkip (line 39) | class _DummyTaskToCheckSkip(gokart.TaskOnKart[None]):
method _run (line 42) | def _run(self): ...
method run (line 44) | def run(self):
method complete (line 48) | def complete(self) -> bool:
class TestWorkerSkipIfCompletedPreRun (line 52) | class TestWorkerSkipIfCompletedPreRun:
method test_skip_task (line 62) | def test_skip_task(self, monkeypatch: pytest.MonkeyPatch, task_complet...
class TestWorkerCheckCompleteValue (line 85) | class TestWorkerCheckCompleteValue:
method test_does_not_raise_for_boolean_values (line 86) | def test_does_not_raise_for_boolean_values(self) -> None:
method test_raises_async_completion_exception_for_traceback_wrapper (line 91) | def test_raises_async_completion_exception_for_traceback_wrapper(self)...
method test_raises_exception_for_non_boolean_value (line 99) | def test_raises_exception_for_non_boolean_value(self) -> None:
FILE: test/test_zoned_date_second_parameter.py
class ZonedDateSecondParameterTaskWithoutDefault (line 9) | class ZonedDateSecondParameterTaskWithoutDefault(TaskOnKart[datetime.dat...
method run (line 13) | def run(self):
class ZonedDateSecondParameterTaskWithDefault (line 17) | class ZonedDateSecondParameterTaskWithDefault(TaskOnKart[datetime.dateti...
method run (line 23) | def run(self):
class ZonedDateSecondParameterTest (line 27) | class ZonedDateSecondParameterTest(unittest.TestCase):
method setUp (line 28) | def setUp(self):
method test_default (line 32) | def test_default(self):
method test_parse_param_with_tz_suffix (line 36) | def test_parse_param_with_tz_suffix(self):
method test_parse_param_with_Z_suffix (line 40) | def test_parse_param_with_Z_suffix(self):
method test_parse_param_without_timezone_input (line 44) | def test_parse_param_without_timezone_input(self):
method test_parse_method (line 48) | def test_parse_method(self):
method test_serialize_task (line 53) | def test_serialize_task(self):
FILE: test/testing/test_pandas_assert.py
class TestPandasAssert (line 8) | class TestPandasAssert(unittest.TestCase):
method test_assert_frame_contents_equal (line 9) | def test_assert_frame_contents_equal(self):
method test_assert_frame_contents_equal_with_small_error (line 15) | def test_assert_frame_contents_equal_with_small_error(self):
method test_assert_frame_contents_equal_with_duplicated_columns (line 21) | def test_assert_frame_contents_equal_with_duplicated_columns(self):
method test_assert_frame_contents_equal_with_duplicated_indexes (line 30) | def test_assert_frame_contents_equal_with_duplicated_indexes(self):
FILE: test/tree/test_task_info.py
class _SubTask (line 15) | class _SubTask(gokart.TaskOnKart[str]):
method output (line 19) | def output(self):
method run (line 22) | def run(self):
class _Task (line 26) | class _Task(gokart.TaskOnKart[str]):
method requires (line 31) | def requires(self):
method output (line 34) | def output(self):
method run (line 37) | def run(self):
class _DoubleLoadSubTask (line 41) | class _DoubleLoadSubTask(gokart.TaskOnKart[str]):
method output (line 46) | def output(self):
method run (line 49) | def run(self):
class TestInfo (line 53) | class TestInfo(unittest.TestCase):
method setUp (line 54) | def setUp(self) -> None:
method tearDown (line 59) | def tearDown(self) -> None:
method test_make_tree_info_pending (line 64) | def test_make_tree_info_pending(self):
method test_make_tree_info_complete (line 75) | def test_make_tree_info_complete(self):
method test_make_tree_info_abbreviation (line 87) | def test_make_tree_info_abbreviation(self):
method test_make_tree_info_not_compress (line 105) | def test_make_tree_info_not_compress(self):
method test_make_tree_info_not_compress_ignore_task (line 123) | def test_make_tree_info_not_compress_ignore_task(self):
method test_make_tree_info_with_cache (line 137) | def test_make_tree_info_with_cache(self):
class _TaskInfoExampleTaskA (line 148) | class _TaskInfoExampleTaskA(gokart.TaskOnKart[Any]):
class _TaskInfoExampleTaskB (line 152) | class _TaskInfoExampleTaskB(gokart.TaskOnKart[Any]):
class _TaskInfoExampleTaskC (line 156) | class _TaskInfoExampleTaskC(gokart.TaskOnKart[str]):
method requires (line 159) | def requires(self):
method run (line 162) | def run(self):
class TestTaskInfoTable (line 166) | class TestTaskInfoTable(unittest.TestCase):
method test_dump_task_info_table (line 167) | def test_dump_task_info_table(self):
class TestTaskInfoTree (line 183) | class TestTaskInfoTree(unittest.TestCase):
method test_dump_task_info_tree (line 184) | def test_dump_task_info_tree(self):
method test_dump_task_info_tree_with_invalid_path_extention (line 201) | def test_dump_task_info_tree_with_invalid_path_extention(self):
FILE: test/tree/test_task_info_formatter.py
class _RequiredTaskExampleTaskA (line 8) | class _RequiredTaskExampleTaskA(gokart.TaskOnKart[Any]):
class TestMakeRequiresInfo (line 12) | class TestMakeRequiresInfo(unittest.TestCase):
method test_make_requires_info_with_task_on_kart (line 13) | def test_make_requires_info_with_task_on_kart(self):
method test_make_requires_info_with_list (line 19) | def test_make_requires_info_with_list(self):
method test_make_requires_info_with_generator (line 25) | def test_make_requires_info_with_generator(self):
method test_make_requires_info_with_dict (line 33) | def test_make_requires_info_with_dict(self):
method test_make_requires_info_with_invalid (line 39) | def test_make_requires_info_with_invalid(self):
FILE: test/util.py
function _get_temporary_directory (line 6) | def _get_temporary_directory():
Condensed preview — 125 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (501K chars).
[
{
"path": ".github/CODEOWNERS",
"chars": 85,
"preview": "* @Hi-king @yokomotod @hirosassa @mski-iksm @kitagry @ujiuji1259 @mamo3gr @hiro-o918\n"
},
{
"path": ".github/workflows/format.yml",
"chars": 530,
"preview": "name: Lint\n\non:\n push:\n branches: [ master ]\n pull_request:\n\n\njobs:\n formatting-check:\n\n name: Lint\n "
},
{
"path": ".github/workflows/publish.yml",
"chars": 393,
"preview": "name: Publish\n\non:\n push:\n tags: '*'\n\njobs:\n deploy:\n\n runs-on: ubuntu-latest\n\n steps:\n - uses: actions/ch"
},
{
"path": ".github/workflows/test.yml",
"chars": 790,
"preview": "name: Test\n\non:\n push:\n branches: [ master ]\n pull_request:\n\njobs:\n tests:\n runs-on: ${{ matrix.platform }}\n "
},
{
"path": ".gitignore",
"chars": 1305,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": ".readthedocs.yaml",
"chars": 590,
"preview": "# Read the Docs configuration file for Sphinx projects\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html f"
},
{
"path": "LICENSE",
"chars": 1065,
"preview": "MIT License\n\nCopyright (c) 2018 M3, Inc.\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\no"
},
{
"path": "README.md",
"chars": 5314,
"preview": "# gokart\n\n<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/m3dev/gokart/master/docs/gokart_logo_side_iso"
},
{
"path": "docs/Makefile",
"chars": 580,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS =\nSPHI"
},
{
"path": "docs/conf.py",
"chars": 5345,
"preview": "# https://github.com/sphinx-doc/sphinx/issues/6211\nimport luigi\n\nimport gokart\n\nluigi.task.Task.requires.__doc__ = gokar"
},
{
"path": "docs/efficient_run_on_multi_workers.rst",
"chars": 1564,
"preview": "How to improve efficiency when running on multiple workers\n===========================================================\n\n"
},
{
"path": "docs/for_pandas.rst",
"chars": 2187,
"preview": "For Pandas\n==========\n\nGokart has several features for Pandas.\n\n\nPandas Type Config\n------------------\n\nPandas has a fea"
},
{
"path": "docs/gokart.rst",
"chars": 1430,
"preview": "gokart package\n==============\n\nSubmodules\n----------\n\ngokart.file\\_processor module\n-----------------------------\n\n.. au"
},
{
"path": "docs/index.rst",
"chars": 2425,
"preview": ".. gokart documentation master file, created by\n sphinx-quickstart on Fri Jan 11 07:59:25 2019.\n You can adapt this "
},
{
"path": "docs/intro_to_gokart.rst",
"chars": 4375,
"preview": "Intro To Gokart\n===============\n\n\nInstallation\n------------\n\nWithin the activated Python environment, use the following "
},
{
"path": "docs/logging.rst",
"chars": 1392,
"preview": "Logging\n=======\n\nHow to set up a common logger for gokart.\n\n\nCore settings\n-------------\n\nPlease write a configuration f"
},
{
"path": "docs/make.bat",
"chars": 787,
"preview": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sp"
},
{
"path": "docs/mypy_plugin.rst",
"chars": 2036,
"preview": "[Experimental] Mypy plugin\n===========================\n\nMypy plugin provides type checking for gokart tasks using Mypy.\n"
},
{
"path": "docs/polars.rst",
"chars": 6039,
"preview": "Polars Support\n==============\n\nGokart supports Polars DataFrames alongside pandas DataFrames for DataFrame-based file pr"
},
{
"path": "docs/requirements.txt",
"chars": 31,
"preview": "Sphinx\ngokart\nsphinx-rtd-theme\n"
},
{
"path": "docs/setting_task_parameters.rst",
"chars": 2987,
"preview": "============================\nSetting Task Parameters\n============================\n\nThere are several ways to set task pa"
},
{
"path": "docs/slack_notification.rst",
"chars": 1027,
"preview": "Slack notification\n=========================\n\nPrerequisites\n-------------\n\nPrepare following environmental variables:\n\n."
},
{
"path": "docs/task_information.rst",
"chars": 9348,
"preview": "Task Information\n================\n\nThere are 6 ways to print the significant parameters and state of the task and its de"
},
{
"path": "docs/task_on_kart.rst",
"chars": 9228,
"preview": "TaskOnKart\n==========\n``TaskOnKart`` inherits ``luigi.Task``, and has functions to make it easy to define tasks.\nPlease "
},
{
"path": "docs/task_parameters.rst",
"chars": 3678,
"preview": "=================\nTask Parameters\n=================\n\nLuigi Parameter\n================\n\nWe can set parameters for tasks.\n"
},
{
"path": "docs/task_settings.rst",
"chars": 3869,
"preview": "Task Settings\n=============\n\nTask settings. Also please refer to :doc:`task_parameters` section.\n\n\nDirectory to Save Out"
},
{
"path": "docs/tutorial.rst",
"chars": 6728,
"preview": "Tutorial\n========\n\nAlso please refer to :doc:`intro_to_gokart` section.\n\n\n1, Make gokart project\n----------------------\n"
},
{
"path": "docs/using_task_task_conflict_prevention_lock.rst",
"chars": 1507,
"preview": "Task conflict prevention lock\n=========================\n\nIf there is a possibility of multiple worker nodes executing th"
},
{
"path": "examples/gokart_notebook_example.ipynb",
"chars": 4849,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [\n {\n \"name\":"
},
{
"path": "examples/logging.ini",
"chars": 698,
"preview": "[loggers]\nkeys=root,luigi,luigi-interface,gokart\n\n[handlers]\nkeys=stderrHandler\n\n[formatters]\nkeys=simpleFormatter\n\n[log"
},
{
"path": "examples/param.ini",
"chars": 124,
"preview": "[TaskOnKart]\nworkspace_directory=./resource\nlocal_temporary_directory=./resource/tmp\n\n[core]\nlogging_conf_file=logging.i"
},
{
"path": "gokart/__init__.py",
"chars": 1013,
"preview": "__all__ = [\n 'build',\n 'WorkerSchedulerFactory',\n 'make_tree_info',\n 'tree_info',\n 'PandasTypeConfig',\n "
},
{
"path": "gokart/build.py",
"chars": 8150,
"preview": "from __future__ import annotations\n\nimport enum\nimport io\nimport logging\nfrom dataclasses import dataclass\nfrom functool"
},
{
"path": "gokart/config_params.py",
"chars": 1587,
"preview": "from __future__ import annotations\n\nfrom typing import Any\n\nimport luigi\n\nimport gokart\n\n\nclass inherits_config_params:\n"
},
{
"path": "gokart/conflict_prevention_lock/task_lock.py",
"chars": 4549,
"preview": "from __future__ import annotations\n\nimport functools\nimport os\nfrom logging import getLogger\nfrom typing import Any, Nam"
},
{
"path": "gokart/conflict_prevention_lock/task_lock_wrappers.py",
"chars": 4316,
"preview": "from __future__ import annotations\n\nimport functools\nfrom collections.abc import Callable\nfrom logging import getLogger\n"
},
{
"path": "gokart/errors/__init__.py",
"chars": 273,
"preview": "from gokart.build import GokartBuildError, HasLockedTaskException\nfrom gokart.pandas_type_config import PandasTypeError\n"
},
{
"path": "gokart/file_processor/__init__.py",
"chars": 7891,
"preview": "\"\"\"File processor module with support for multiple DataFrame backends.\"\"\"\n\nfrom __future__ import annotations\n\nimport os"
},
{
"path": "gokart/file_processor/base.py",
"chars": 4652,
"preview": "from __future__ import annotations\n\nimport xml.etree.ElementTree as ET\nfrom abc import abstractmethod\nfrom io import Byt"
},
{
"path": "gokart/file_processor/pandas.py",
"chars": 5757,
"preview": "\"\"\"Pandas-specific file processor implementations.\"\"\"\n\nfrom __future__ import annotations\n\nfrom io import BytesIO\nfrom t"
},
{
"path": "gokart/file_processor/polars.py",
"chars": 6666,
"preview": "\"\"\"Polars-specific file processor implementations.\"\"\"\n\nfrom __future__ import annotations\n\nfrom io import BytesIO\nfrom t"
},
{
"path": "gokart/file_processor.py",
"chars": 0,
"preview": ""
},
{
"path": "gokart/gcs_config.py",
"chars": 1127,
"preview": "from __future__ import annotations\n\nimport json\nimport os\nfrom typing import cast\n\nimport luigi\nimport luigi.contrib.gcs"
},
{
"path": "gokart/gcs_obj_metadata_client.py",
"chars": 8600,
"preview": "from __future__ import annotations\n\nimport copy\nimport functools\nimport json\nimport re\nfrom collections.abc import Itera"
},
{
"path": "gokart/gcs_zip_client.py",
"chars": 1452,
"preview": "from __future__ import annotations\n\nimport os\nimport shutil\nfrom typing import cast\n\nfrom gokart.gcs_config import GCSCo"
},
{
"path": "gokart/in_memory/__init__.py",
"chars": 203,
"preview": "__all__ = [\n 'InMemoryCacheRepository',\n 'InMemoryTarget',\n 'make_in_memory_target',\n]\n\nfrom .repository import"
},
{
"path": "gokart/in_memory/data.py",
"chars": 361,
"preview": "from __future__ import annotations\n\nfrom dataclasses import dataclass\nfrom datetime import datetime\nfrom typing import A"
},
{
"path": "gokart/in_memory/repository.py",
"chars": 1215,
"preview": "from __future__ import annotations\n\nfrom collections.abc import Iterator\nfrom datetime import datetime\nfrom typing impor"
},
{
"path": "gokart/in_memory/target.py",
"chars": 1879,
"preview": "from __future__ import annotations\n\nfrom datetime import datetime\nfrom typing import Any\n\nfrom gokart.in_memory.reposito"
},
{
"path": "gokart/info.py",
"chars": 1596,
"preview": "from __future__ import annotations\n\nfrom logging import getLogger\nfrom typing import Any\n\nimport luigi\n\nfrom gokart.task"
},
{
"path": "gokart/mypy.py",
"chars": 24412,
"preview": "\"\"\"Plugin that provides support for gokart.TaskOnKart.\n\nThis Code reuses the code from mypy.plugins.dataclasses\nhttps://"
},
{
"path": "gokart/object_storage.py",
"chars": 2522,
"preview": "from __future__ import annotations\n\nfrom datetime import datetime\nfrom typing import cast\n\nimport luigi\nimport luigi.con"
},
{
"path": "gokart/pandas_type_config.py",
"chars": 1896,
"preview": "from __future__ import annotations\n\nfrom abc import abstractmethod\nfrom logging import getLogger\nfrom typing import Any\n"
},
{
"path": "gokart/parameter.py",
"chars": 6681,
"preview": "from __future__ import annotations\n\nimport bz2\nimport datetime\nimport json\nimport sys\nfrom logging import getLogger\nfrom"
},
{
"path": "gokart/py.typed",
"chars": 0,
"preview": ""
},
{
"path": "gokart/required_task_output.py",
"chars": 253,
"preview": "from dataclasses import dataclass\n\n\n@dataclass\nclass RequiredTaskOutput:\n task_name: str\n output_path: str\n\n de"
},
{
"path": "gokart/run.py",
"chars": 5944,
"preview": "from __future__ import annotations\n\nimport logging\nimport os\nimport sys\nfrom logging import getLogger\nfrom typing import"
},
{
"path": "gokart/s3_config.py",
"chars": 870,
"preview": "from __future__ import annotations\n\nimport os\n\nimport luigi\nimport luigi.contrib.s3\n\n\nclass S3Config(luigi.Config):\n "
},
{
"path": "gokart/s3_zip_client.py",
"chars": 1729,
"preview": "from __future__ import annotations\n\nimport os\nimport shutil\nfrom typing import cast\n\nfrom gokart.s3_config import S3Conf"
},
{
"path": "gokart/slack/__init__.py",
"chars": 408,
"preview": "from gokart.slack.event_aggregator import EventAggregator\nfrom gokart.slack.slack_api import SlackAPI\nfrom gokart.slack."
},
{
"path": "gokart/slack/event_aggregator.py",
"chars": 1739,
"preview": "from __future__ import annotations\n\nimport os\nfrom logging import getLogger\nfrom typing import Any, TypedDict\n\nimport lu"
},
{
"path": "gokart/slack/slack_api.py",
"chars": 1852,
"preview": "from __future__ import annotations\n\nfrom logging import getLogger\n\nimport slack_sdk\n\nlogger = getLogger(__name__)\n\n\nclas"
},
{
"path": "gokart/slack/slack_config.py",
"chars": 721,
"preview": "from __future__ import annotations\n\nimport luigi\n\n\nclass SlackConfig(luigi.Config):\n token_name = luigi.Parameter(def"
},
{
"path": "gokart/target.py",
"chars": 10028,
"preview": "from __future__ import annotations\n\nimport hashlib\nimport os\nimport shutil\nfrom abc import abstractmethod\nfrom datetime "
},
{
"path": "gokart/task.py",
"chars": 28232,
"preview": "from __future__ import annotations\n\nimport functools\nimport hashlib\nimport inspect\nimport os\nimport random\nimport types\n"
},
{
"path": "gokart/task_complete_check.py",
"chars": 581,
"preview": "from __future__ import annotations\n\nimport functools\nfrom collections.abc import Callable\nfrom logging import getLogger\n"
},
{
"path": "gokart/testing/__init__.py",
"chars": 288,
"preview": "__all__ = [\n 'test_run',\n 'try_to_run_test_for_empty_data_frame',\n 'assert_frame_contents_equal',\n]\n\nfrom gokar"
},
{
"path": "gokart/testing/check_if_run_with_empty_data_frame.py",
"chars": 2986,
"preview": "from __future__ import annotations\n\nimport logging\nimport sys\nfrom typing import Any\n\nimport luigi\nfrom luigi.cmdline_pa"
},
{
"path": "gokart/testing/pandas_assert.py",
"chars": 1362,
"preview": "from __future__ import annotations\n\nfrom typing import Any\n\nimport pandas as pd\n\n\ndef assert_frame_contents_equal(actual"
},
{
"path": "gokart/tree/task_info.py",
"chars": 3978,
"preview": "from __future__ import annotations\n\nimport os\nfrom typing import Any\n\nimport pandas as pd\n\nfrom gokart.target import mak"
},
{
"path": "gokart/tree/task_info_formatter.py",
"chars": 4825,
"preview": "from __future__ import annotations\n\nimport typing\nimport warnings\nfrom dataclasses import dataclass\nfrom typing import A"
},
{
"path": "gokart/utils.py",
"chars": 4799,
"preview": "from __future__ import annotations\n\nimport os\nfrom collections.abc import Callable, Iterable\nfrom io import BytesIO\nfrom"
},
{
"path": "gokart/worker.py",
"chars": 50009,
"preview": "#\n# Copyright 2012-2015 Spotify AB\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use"
},
{
"path": "gokart/workspace_management.py",
"chars": 1264,
"preview": "from __future__ import annotations\n\nimport itertools\nimport os\nimport pathlib\nfrom logging import getLogger\nfrom typing "
},
{
"path": "gokart/zip_client.py",
"chars": 1430,
"preview": "from __future__ import annotations\n\nimport os\nimport shutil\nimport zipfile\nfrom abc import abstractmethod\nfrom typing im"
},
{
"path": "gokart/zip_client_util.py",
"chars": 468,
"preview": "from __future__ import annotations\n\nfrom gokart.object_storage import ObjectStorage\nfrom gokart.zip_client import LocalZ"
},
{
"path": "luigi.cfg",
"chars": 31,
"preview": "[core]\n autoload_range: false\n"
},
{
"path": "pyproject.toml",
"chars": 3214,
"preview": "[build-system]\nrequires = [\"hatchling\", \"uv-dynamic-versioning\"]\nbuild-backend = \"hatchling.build\"\n\n[project]\nname = \"go"
},
{
"path": "test/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "test/config/__init__.py",
"chars": 351,
"preview": "from pathlib import Path\nfrom typing import Final\n\nCONFIG_DIR: Final[Path] = Path(__file__).parent.resolve()\nPYPROJECT_T"
},
{
"path": "test/config/pyproject.toml",
"chars": 182,
"preview": "[tool.mypy]\nplugins = [\"gokart.mypy\"]\n\n[[tool.mypy.overrides]]\nignore_missing_imports = true\nmodule = [\"pandas.*\", \"apsc"
},
{
"path": "test/config/pyproject_disallow_missing_parameters.toml",
"chars": 237,
"preview": "[tool.mypy]\nplugins = [\"gokart.mypy\"]\n\n[[tool.mypy.overrides]]\nignore_missing_imports = true\nmodule = [\"pandas.*\", \"apsc"
},
{
"path": "test/config/test_config.ini",
"chars": 98,
"preview": "[test_read_config._DummyTask]\nparam = ${test_param}\n\n[test_build._DummyTask]\nparam = ${test_param}"
},
{
"path": "test/conflict_prevention_lock/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "test/conflict_prevention_lock/test_task_lock.py",
"chars": 3613,
"preview": "import random\nimport unittest\nfrom typing import Any\nfrom unittest.mock import patch\n\nimport gokart\nfrom gokart.conflict"
},
{
"path": "test/conflict_prevention_lock/test_task_lock_wrappers.py",
"chars": 13417,
"preview": "import time\nimport unittest\nfrom unittest.mock import MagicMock, patch\n\nimport fakeredis\n\nfrom gokart.conflict_preventio"
},
{
"path": "test/file_processor/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "test/file_processor/test_base.py",
"chars": 2435,
"preview": "\"\"\"Tests for base file processors (non-DataFrame processors).\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nimport t"
},
{
"path": "test/file_processor/test_factory.py",
"chars": 2219,
"preview": "\"\"\"Tests for file processor factory function.\"\"\"\n\nfrom __future__ import annotations\n\nimport unittest\n\nfrom gokart.file_"
},
{
"path": "test/file_processor/test_pandas.py",
"chars": 6462,
"preview": "\"\"\"Tests for pandas-specific file processors.\"\"\"\n\nfrom __future__ import annotations\n\nimport tempfile\nimport unittest\n\ni"
},
{
"path": "test/file_processor/test_polars.py",
"chars": 22976,
"preview": "\"\"\"Tests for polars-specific file processors.\"\"\"\n\nfrom __future__ import annotations\n\nimport tempfile\nfrom typing import"
},
{
"path": "test/in_memory/test_in_memory_target.py",
"chars": 1790,
"preview": "from datetime import datetime\nfrom time import sleep\n\nimport pytest\n\nfrom gokart.conflict_prevention_lock.task_lock impo"
},
{
"path": "test/in_memory/test_repository.py",
"chars": 1924,
"preview": "import time\n\nimport pytest\n\nfrom gokart.in_memory import InMemoryCacheRepository as Repo\n\ndummy_num = 100\n\n\nclass TestIn"
},
{
"path": "test/slack/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "test/slack/test_slack_api.py",
"chars": 4116,
"preview": "import unittest\nfrom logging import getLogger\nfrom unittest import mock\nfrom unittest.mock import MagicMock\n\nfrom slack_"
},
{
"path": "test/test_build.py",
"chars": 9450,
"preview": "from __future__ import annotations\n\nimport io\nimport logging\nimport os\nimport sys\nimport unittest\nfrom copy import copy\n"
},
{
"path": "test/test_cache_unique_id.py",
"chars": 1906,
"preview": "import os\nimport unittest\nfrom typing import Any\n\nimport luigi\nimport luigi.mock\n\nimport gokart\n\n\nclass _DummyTask(gokar"
},
{
"path": "test/test_config_params.py",
"chars": 2921,
"preview": "import unittest\nfrom typing import Any\n\nimport luigi\nfrom luigi.cmdline_parser import CmdlineParser\n\nimport gokart\nfrom "
},
{
"path": "test/test_explicit_bool_parameter.py",
"chars": 1814,
"preview": "import unittest\nfrom typing import Any\n\nimport luigi\nimport luigi.mock\nfrom luigi.cmdline_parser import CmdlineParser\n\ni"
},
{
"path": "test/test_gcs_config.py",
"chars": 1450,
"preview": "import os\nimport unittest\nfrom unittest.mock import MagicMock, patch\n\nfrom gokart.gcs_config import GCSConfig\n\n\nclass Te"
},
{
"path": "test/test_gcs_obj_metadata_client.py",
"chars": 6284,
"preview": "from __future__ import annotations\n\nimport datetime\nimport unittest\nfrom typing import Any\nfrom unittest.mock import Mag"
},
{
"path": "test/test_info.py",
"chars": 3545,
"preview": "import unittest\nfrom unittest.mock import patch\n\nimport luigi\nimport luigi.mock\nfrom luigi.mock import MockFileSystem, M"
},
{
"path": "test/test_large_data_fram_processor.py",
"chars": 1217,
"preview": "import os\nimport shutil\nimport unittest\n\nimport numpy as np\nimport pandas as pd\n\nfrom gokart.target import LargeDataFram"
},
{
"path": "test/test_list_task_instance_parameter.py",
"chars": 938,
"preview": "import unittest\nfrom typing import Any\n\nimport luigi\n\nimport gokart\nfrom gokart import TaskOnKart\n\n\nclass _DummySubTask("
},
{
"path": "test/test_mypy.py",
"chars": 2649,
"preview": "import tempfile\nimport unittest\n\nfrom mypy import api\n\nfrom test.config import PYPROJECT_TOML\n\n\nclass TestMyMypyPlugin(u"
},
{
"path": "test/test_pandas_type_check_framework.py",
"chars": 3161,
"preview": "from __future__ import annotations\n\nimport logging\nimport unittest\nfrom logging import getLogger\nfrom typing import Any\n"
},
{
"path": "test/test_pandas_type_config.py",
"chars": 1485,
"preview": "from __future__ import annotations\n\nfrom datetime import date, datetime\nfrom typing import Any\nfrom unittest import Test"
},
{
"path": "test/test_restore_task_by_id.py",
"chars": 1168,
"preview": "import unittest\nfrom typing import Any\nfrom unittest.mock import patch\n\nimport luigi\nimport luigi.mock\n\nimport gokart\n\n\n"
},
{
"path": "test/test_run.py",
"chars": 4120,
"preview": "import os\nimport unittest\nfrom typing import Any\nfrom unittest.mock import MagicMock, patch\n\nimport luigi\nimport luigi.m"
},
{
"path": "test/test_s3_config.py",
"chars": 273,
"preview": "import unittest\n\nfrom gokart.s3_config import S3Config\n\n\nclass TestS3Config(unittest.TestCase):\n def test_get_same_s3"
},
{
"path": "test/test_s3_zip_client.py",
"chars": 2089,
"preview": "import os\nimport shutil\nimport unittest\n\nimport boto3\nfrom moto import mock_aws\n\nfrom gokart.s3_zip_client import S3ZipC"
},
{
"path": "test/test_serializable_parameter.py",
"chars": 2922,
"preview": "import json\nimport tempfile\nfrom dataclasses import asdict, dataclass\nfrom typing import Any\n\nimport luigi\nimport pytest"
},
{
"path": "test/test_target.py",
"chars": 9768,
"preview": "import io\nimport os\nimport shutil\nimport unittest\nfrom datetime import datetime\nfrom unittest.mock import patch\n\nimport "
},
{
"path": "test/test_task_instance_parameter.py",
"chars": 3729,
"preview": "import unittest\nfrom typing import Any\n\nimport luigi\n\nimport gokart\nfrom gokart import ListTaskInstanceParameter, TaskIn"
},
{
"path": "test/test_task_on_kart.py",
"chars": 25838,
"preview": "from __future__ import annotations\n\nimport os\nimport pathlib\nimport unittest\nfrom datetime import datetime\nfrom typing i"
},
{
"path": "test/test_utils.py",
"chars": 6637,
"preview": "import unittest\nfrom typing import TYPE_CHECKING\n\nimport pandas as pd\nimport pytest\n\nfrom gokart.task import TaskOnKart\n"
},
{
"path": "test/test_worker.py",
"chars": 4022,
"preview": "import uuid\nfrom unittest.mock import Mock\n\nimport luigi\nimport luigi.worker\nimport pytest\nfrom luigi import scheduler\n\n"
},
{
"path": "test/test_zoned_date_second_parameter.py",
"chars": 2613,
"preview": "import datetime\nimport unittest\n\nfrom luigi.cmdline_parser import CmdlineParser\n\nfrom gokart import TaskOnKart, ZonedDat"
},
{
"path": "test/testing/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "test/testing/test_pandas_assert.py",
"chars": 1816,
"preview": "import unittest\n\nimport pandas as pd\n\nimport gokart\n\n\nclass TestPandasAssert(unittest.TestCase):\n def test_assert_fra"
},
{
"path": "test/tree/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "test/tree/test_task_info.py",
"chars": 7813,
"preview": "from __future__ import annotations\n\nimport unittest\nfrom typing import Any\nfrom unittest.mock import patch\n\nimport luigi"
},
{
"path": "test/tree/test_task_info_formatter.py",
"chars": 1835,
"preview": "import unittest\nfrom typing import Any\n\nimport gokart\nfrom gokart.tree.task_info_formatter import RequiredTask, _make_re"
},
{
"path": "test/util.py",
"chars": 248,
"preview": "import os\nimport uuid\n\n\n# TODO: use pytest.fixture to share this functionality with other tests\ndef _get_temporary_direc"
},
{
"path": "tox.ini",
"chars": 401,
"preview": "[tox]\nenvlist = py{310,311,312,313,314},ruff,mypy\nskipsdist = True\n\n[testenv]\nrunner = uv-venv-lock-runner\ndependency_gr"
}
]
About this extraction
This page contains the full source code of the m3dev/gokart GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 125 files (466.8 KB), approximately 114.0k tokens, and a symbol index with 1009 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.