Showing preview only (776K chars total). Download the full file or copy to clipboard to get everything.
Repository: alteryx/compose
Branch: main
Commit: 87953ceaab69
Files: 96
Total size: 740.1 KB
Directory structure:
gitextract_0qfvetgz/
├── .codecov.yml
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── blank_issue.md
│ │ ├── bug_report.md
│ │ ├── config.yml
│ │ ├── documentation_improvement.md
│ │ └── feature_request.md
│ ├── auto_assign.yml
│ └── workflows/
│ ├── auto_approve_dependency_PRs.yml
│ ├── build_docs.yml
│ ├── create_feedstock_pr.yaml
│ ├── install_test.yml
│ ├── latest_dependency_checker.yml
│ ├── lint_check.yml
│ ├── release.yml
│ ├── release_notes_updated.yml
│ └── unit_tests_with_latest_deps.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .readthedocs.yaml
├── LICENSE
├── Makefile
├── README.md
├── composeml/
│ ├── __init__.py
│ ├── conftest.py
│ ├── data_slice/
│ │ ├── __init__.py
│ │ ├── extension.py
│ │ ├── generator.py
│ │ └── offset.py
│ ├── demos/
│ │ ├── __init__.py
│ │ └── transactions.csv
│ ├── label_maker.py
│ ├── label_search.py
│ ├── label_times/
│ │ ├── __init__.py
│ │ ├── description.py
│ │ ├── deserialize.py
│ │ ├── object.py
│ │ └── plots.py
│ ├── tests/
│ │ ├── __init__.py
│ │ ├── requirement_files/
│ │ │ ├── latest_core_dependencies.txt
│ │ │ ├── minimum_core_requirements.txt
│ │ │ └── minimum_test_requirements.txt
│ │ ├── test_data_slice/
│ │ │ ├── __init__.py
│ │ │ ├── test_extension.py
│ │ │ └── test_offset.py
│ │ ├── test_datasets.py
│ │ ├── test_featuretools.py
│ │ ├── test_label_maker.py
│ │ ├── test_label_plots.py
│ │ ├── test_label_serialization.py
│ │ ├── test_label_times.py
│ │ ├── test_label_transforms/
│ │ │ ├── __init__.py
│ │ │ ├── test_bin.py
│ │ │ ├── test_lead.py
│ │ │ ├── test_sample.py
│ │ │ └── test_threshold.py
│ │ ├── test_version.py
│ │ └── utils.py
│ ├── update_checker.py
│ └── version.py
├── contributing.md
├── docs/
│ ├── Makefile
│ ├── make.bat
│ └── source/
│ ├── _static/
│ │ └── style.css
│ ├── _templates/
│ │ ├── class.rst
│ │ └── layout.html
│ ├── api_reference.rst
│ ├── conf.py
│ ├── examples/
│ │ ├── demo/
│ │ │ ├── __init__.py
│ │ │ ├── chicago_bike/
│ │ │ │ ├── __init__.py
│ │ │ │ └── sample.csv
│ │ │ ├── next_purchase/
│ │ │ │ ├── __init__.py
│ │ │ │ └── sample.csv
│ │ │ ├── turbofan_degredation/
│ │ │ │ ├── __init__.py
│ │ │ │ └── sample.csv
│ │ │ └── utils.py
│ │ ├── predict_bike_trips.ipynb
│ │ ├── predict_next_purchase.ipynb
│ │ └── predict_turbofan_degredation.ipynb
│ ├── images/
│ │ ├── innovation_labs.xml
│ │ ├── label-maker.xml
│ │ ├── labeling-function.xml
│ │ └── workflow.xml
│ ├── index.rst
│ ├── install.md
│ ├── release_notes.rst
│ ├── resources/
│ │ ├── faq.ipynb
│ │ └── help.rst
│ ├── resources.rst
│ ├── start.ipynb
│ ├── tutorials.rst
│ ├── user_guide/
│ │ ├── controlling_cutoff_times.ipynb
│ │ ├── data_slice_generator.ipynb
│ │ └── using_label_transforms.ipynb
│ └── user_guide.rst
├── pyproject.toml
└── release.md
================================================
FILE CONTENTS
================================================
================================================
FILE: .codecov.yml
================================================
codecov:
notify:
require_ci_to_pass: yes
comment:
layout: "diff, files"
coverage:
precision: 2
round: down
range: 90..100
status:
project:
default:
target: 100%
patch:
default:
target: 100%
changes: no
ignore:
- "composeml/update_checker.py"
================================================
FILE: .github/ISSUE_TEMPLATE/blank_issue.md
================================================
---
name: Blank Issue
about: Create a blank issue
title: ''
labels: ''
assignees: ''
---
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug Report
about: Create a bug report to help us improve Compose
title: ''
labels: 'bug'
assignees: ''
---
[A clear and concise description of what the bug is.]
#### Code Sample, a copy-pastable example to reproduce your bug.
```python
# Your code here
```
================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: true
contact_links:
- name: General Technical Question
about: "If you have a question like *How should I create my label times?* you can ask on StackOverflow using the #compose-ml tag."
url: https://stackoverflow.com/questions/tagged/compose-ml
- name: Real-time chat
url: https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA
about: "If you want to meet others in the community and chat about all things Alteryx OSS then check out our Slack."
================================================
FILE: .github/ISSUE_TEMPLATE/documentation_improvement.md
================================================
---
name: Documentation Improvement
about: Suggest an idea for improving the documentation
title: ''
labels: 'documentation'
assignees: ''
---
[a description of what documentation you believe needs to be fixed/improved]
================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.md
================================================
---
name: Feature Request
about: Suggest an idea for this project
title: ''
labels: 'new feature'
assignees: ''
---
- As a [user/developer], I wish I could use Compose to ...
#### Code Example
```python
# Your code here, if applicable
```
================================================
FILE: .github/auto_assign.yml
================================================
# Set to author to set pr creator as assignee
addAssignees: author
================================================
FILE: .github/workflows/auto_approve_dependency_PRs.yml
================================================
name: Auto Approve Dependency PRs
on:
schedule:
- cron: '*/30 * * * *'
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Find dependency PRs
id: find_prs
run: |
gh auth status
gh pr list --repo "${{ github.repository }}" --assignee "machineFL" --base main --state open --search "status:success review:required" --limit 1 --json number > dep_PRs_waiting_approval.json
dep_pull_request=$(cat dep_PRs_waiting_approval.json | grep -Eo "[0-9]*")
echo ::set-output name=dep_pull_request::${dep_pull_request}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }}
- name: Approve dependency PRs and enable auto-merge
if: ${{ steps.find_prs.outputs.dep_pull_request > 1 }}
run: |
gh pr review --repo "${{ github.repository }}" --comment --body "auto approve" ${{ steps.find_prs.outputs.dep_pull_request }}
gh pr review --repo "${{ github.repository }}" --approve ${{ steps.find_prs.outputs.dep_pull_request }}
gh pr merge --repo "${{ github.repository }}" --auto --squash --delete-branch ${{ steps.find_prs.outputs.dep_pull_request }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }}
================================================
FILE: .github/workflows/build_docs.yml
================================================
on:
pull_request:
types: [opened, synchronize]
push:
branches:
- main
name: Build Docs
jobs:
doc_tests:
name: Doc Tests / Python 3.8
runs-on: ubuntu-latest
steps:
- name: Set up Python 3.8
uses: actions/setup-python@v4
with:
python-version: 3.8
- name: Checkout repository
uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
- name: Build source distribution
run: make package
- name: Install package with doc requirements
run: |
python -m pip config --site set global.progress_bar off
python -m pip install unpacked_sdist/
python -m pip install unpacked_sdist/[dev]
python -m pip install unpacked_sdist/[docs]
python -m pip check
sudo apt install -q -y pandoc
sudo apt install -q -y graphviz
- name: Run doc tests
run: make -C docs/ -e "SPHINXOPTS=-W" clean html
================================================
FILE: .github/workflows/create_feedstock_pr.yaml
================================================
name: Create Feedstock PR
on:
workflow_dispatch:
inputs:
version:
description: 'released PyPI version to use (ex - v1.11.1)'
required: true
jobs:
create_feedstock_pr:
name: Create Feedstock PR
runs-on: ubuntu-latest
steps:
- name: Checkout inputted version
uses: actions/checkout@v3
with:
repository: ${{ github.event.pull_request.head.repo.full_name }}
ref: ${{ github.event.inputs.version }}
path: "./compose"
- name: Pull latest from upstream for user forked feedstock
run: |
gh auth status
gh repo sync alteryx/composeml-feedstock --branch main --source conda-forge/composeml-feedstock --force
env:
GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }}
- uses: actions/checkout@v3
with:
repository: alteryx/composeml-feedstock
ref: main
path: "./composeml-feedstock"
fetch-depth: '0'
- name: Run Create Feedstock meta YAML
id: create-feedstock-meta
uses: alteryx/create-feedstock-meta-yaml@v4
with:
project: "composeml"
pypi_version: ${{ github.event.inputs.version }}
project_metadata_filepath: "compose/pyproject.toml"
meta_yaml_filepath: "composeml-feedstock/recipe/meta.yaml"
- name: View updated meta yaml
run: cat composeml-feedstock/recipe/meta.yaml
- name: Push updated yaml
run: |
cd composeml-feedstock
git config --unset-all http.https://github.com/.extraheader
git config --global user.email "machineOSS@alteryx.com"
git config --global user.name "machineAYX Bot"
git remote set-url origin https://${{ secrets.AUTO_APPROVE_TOKEN }}@github.com/alteryx/composeml-feedstock
git checkout -b ${{ github.event.inputs.version }}
git add recipe/meta.yaml
git commit -m "${{ github.event.inputs.version }}"
git push origin ${{ github.event.inputs.version }}
- name: Adding URL to job output
run: |
echo "Conda Feedstock Pull Request: https://github.com/alteryx/composeml-feedstock/pull/new/${{ github.event.inputs.version }}" >> $GITHUB_STEP_SUMMARY
================================================
FILE: .github/workflows/install_test.yml
================================================
on:
pull_request:
types: [opened, synchronize]
push:
branches:
- main
name: Install Test
jobs:
install_cm_complete:
name: ${{ matrix.os }} - ${{ matrix.python_version }} install compose
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest]
python_version: ["3.8", "3.9", "3.10", "3.11"]
runs-on: ${{ matrix.os }}
steps:
- name: Set up python ${{ matrix.python_version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python_version }}
- name: Checkout repository
uses: actions/checkout@v3
- name: Build compose package
run: make package
- name: Install compose complete from sdist
run: |
pip config --site set global.progress_bar off
python -m pip install "unpacked_sdist/[complete]"
- name: Test by importing packages
run: |
python -c "import alteryx_open_src_update_checker"
env:
ALTERYX_OPEN_SRC_UPDATE_CHECKER: False
- name: Check package conflicts
run: |
python -m pip check
================================================
FILE: .github/workflows/latest_dependency_checker.yml
================================================
# This workflow will install dependenies and if any critical dependencies have changed a pull request
# will be created which will trigger a CI run with the new dependencies.
name: Latest Dependency Checker
on:
workflow_dispatch:
schedule:
- cron: '0 * * * *'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.8
uses: actions/setup-python@v4
with:
python-version: '3.8.x'
- name: Install pip and virtualenv
run: |
python -m pip install --upgrade pip
python -m pip install virtualenv
- name: Update latest core dependencies
run: |
python -m virtualenv venv_core
source venv_core/bin/activate
python -m pip install --upgrade pip
python -m pip install .[test]
make checkdeps OUTPUT_FILEPATH=composeml/tests/requirement_files/latest_core_dependencies.txt
- name: Create Pull Request
uses: peter-evans/create-pull-request@v3
with:
token: ${{ secrets.REPO_SCOPED_TOKEN }}
commit-message: Update latest dependencies
title: Automated Latest Dependency Updates
author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
body: "This is an auto-generated PR with **latest** dependency updates.
Please do not delete the `latest-dep-update` branch because it's needed by the auto-dependency bot."
branch: latest-dep-update
branch-suffix: short-commit-hash
base: main
assignees: machineFL
reviewers: machineAYX
================================================
FILE: .github/workflows/lint_check.yml
================================================
on:
pull_request:
types: [opened, synchronize]
push:
branches:
- main
name: Lint Check
jobs:
lint_test:
name: ${{ matrix.python_version }} lint check
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python_version: ["3.11"]
steps:
- name: Set up python ${{ matrix.python_version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python_version }}
- name: Checkout repository
uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
- name: Build compose package
run: make package
- name: Install compose with dev, and test requirements
run: |
pip config --site set global.progress_bar off
python -m pip install --upgrade pip
python -m pip install .[dev]
- name: Run lint test
run: make lint
================================================
FILE: .github/workflows/release.yml
================================================
on:
release:
types: [published]
name: Release
jobs:
pypi:
name: Release to PyPI
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Remove docs and tests before release
run: |
rm -rf docs/
- name: Upload to PyPI
uses: FeatureLabs/gh-action-pypi-upload@v2
env:
PYPI_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
PYPI_USERNAME: ${{ secrets.PYPI_USERNAME }}
TEST_PYPI_USERNAME: ${{ secrets.TEST_PYPI_USERNAME }}
TEST_PYPI_PASSWORD: ${{ secrets.TEST_PYPI_PASSWORD }}
================================================
FILE: .github/workflows/release_notes_updated.yml
================================================
name: Release Notes Updated
on:
pull_request:
types: [opened, synchronize]
jobs:
release_notes_updated:
name: release notes updated
runs-on: ubuntu-latest
steps:
- name: Check for development branch
id: branch
shell: python
run: |
from re import compile
main = '^main$'
release = '^release_v\d+\.\d+\.\d+$'
dep_update = '^latest-dep-update-[a-f0-9]{7}$'
min_dep_update = '^min-dep-update-[a-f0-9]{7}$'
regex = main, release, dep_update, min_dep_update
patterns = list(map(compile, regex))
ref = "${{ github.event.pull_request.head.ref }}"
is_dev = not any(pattern.match(ref) for pattern in patterns)
print('::set-output name=is_dev::' + str(is_dev))
- name: Checkout repository
if: ${{ steps.branch.outputs.is_dev == 'True' }}
uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
- name: Check if release notes were updated
if: ${{ steps.branch.outputs.is_dev == 'True' }}
run: cat docs/source/release_notes.rst | grep ":pr:\`${{ github.event.number }}\`"
================================================
FILE: .github/workflows/unit_tests_with_latest_deps.yml
================================================
on:
pull_request:
types: [opened, synchronize]
push:
branches:
- main
name: Unit Tests - Latest Dependencies
jobs:
unit_tests:
name: Unit Tests / Python ${{ matrix.python-version }}
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
steps:
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Checkout repository
uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
- name: Build source distribution
run: make package
- name: Install package with test requirements
run: |
python -m pip config --site set global.progress_bar off
python -m pip install --upgrade pip
python -m pip install unpacked_sdist/[test]
- if: ${{ matrix.python-version == 3.8 }}
name: Run unit tests with code coverage
run: |
coverage erase
cd unpacked_sdist/
pytest composeml/ -n auto --cov=composeml --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.python-version != 3.8 }}
name: Run unit tests with no code coverage
run: |
cd unpacked_sdist/
pytest composeml/ -n auto
- if: ${{ matrix.python-version == 3.8 }}
name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
token: ${{ secrets.CODECOV_TOKEN }}
fail_ci_if_error: true
files: ${{ github.workspace }}/coverage.xml
verbose: true
================================================
FILE: .gitignore
================================================
cb_model.json
.DS_Store
# IDE
.vscode
docs/source/examples/demo/*/download
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
docs/source/generated
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
================================================
FILE: .pre-commit-config.yaml
================================================
exclude: |
(?x)
.html$|.csv$|.svg$|.md$|.txt$|.json$|.xml$|.pickle$|^.github/|
(LICENSE.*|README.*)
default_stages: [commit]
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: 'v4.3.0'
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/kynan/nbstripout
rev: 0.5.0
hooks:
- id: nbstripout
entry: nbstripout
language: python
types: [jupyter]
- repo: https://github.com/MarcoGorelli/absolufy-imports
rev: 'v0.3.1'
hooks:
- id: absolufy-imports
files: ^composeml/
- repo: https://github.com/asottile/add-trailing-comma
rev: v2.2.3
hooks:
- id: add-trailing-comma
name: Add trailing comma
- repo: https://github.com/python/black
rev: 22.12.0
hooks:
- id: black
args:
- --config=./pyproject.toml
additional_dependencies: [".[jupyter]"]
types_or: [python, jupyter]
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: 'v0.0.231'
hooks:
- id: ruff
args:
- --config=./pyproject.toml
- --fix
================================================
FILE: .readthedocs.yaml
================================================
# .readthedocs.yml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Build documentation in the docs/ directory with Sphinx
sphinx:
configuration: docs/source/conf.py
fail_on_warning: true
# Optionally build your docs in additional formats such as PDF and ePub
formats: []
# Optionally set the version of Python and requirements required to build your docs
python:
version: "3.8"
install:
- method: pip
path: .
extra_requirements:
- dev
- docs
================================================
FILE: LICENSE
================================================
BSD 3-Clause License
Copyright (c) 2017, Feature Labs, Inc.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
================================================
FILE: Makefile
================================================
.PHONY: clean
clean:
find . -name '*.pyo' -delete
find . -name '*.pyc' -delete
find . -name __pycache__ -delete
find . -name '*~' -delete
find . -name '.coverage.*' -delete
.PHONY: lint
lint:
black . --check --config=./pyproject.toml
ruff . --config=./pyproject.toml
.PHONY: lint-fix
lint-fix:
black . --config=./pyproject.toml
ruff . --fix --config=./pyproject.toml
.PHONY: test
test:
python -m pytest composeml/ -n auto
.PHONY: testcoverage
testcoverage:
python -m pytest composeml/ --cov=composeml -n auto
.PHONY: installdeps
installdeps: upgradepip
pip install -e ".[dev]"
.PHONY: checkdeps
checkdeps:
$(eval allow_list='matplotlib|pandas|seaborn|woodwork|featuretools|evalml|tqdm')
pip freeze | grep -v "alteryx/compose.git" | grep -E $(allow_list) > $(OUTPUT_FILEPATH)
.PHONY: upgradepip
upgradepip:
python -m pip install --upgrade pip
.PHONY: upgradebuild
upgradebuild:
python -m pip install --upgrade build
.PHONY: upgradesetuptools
upgradesetuptools:
python -m pip install --upgrade setuptools
.PHONY: package
package: upgradepip upgradebuild upgradesetuptools
python -m build
$(eval PACKAGE=$(shell python -c 'import setuptools; setuptools.setup()' --version))
tar -zxvf "dist/composeml-${PACKAGE}.tar.gz"
mv "composeml-${PACKAGE}" unpacked_sdist
================================================
FILE: README.md
================================================
<p align="center"><img width=50% src="https://raw.githubusercontent.com/alteryx/compose/main/docs/source/images/compose.png" alt="Compose" /></p>
<p align="center"><i>"Build better training examples in a fraction of the time."</i></p>
<p align="center">
<a href="https://github.com/alteryx/compose/actions?query=workflow%3ATests" target="_blank">
<img src="https://github.com/alteryx/compose/workflows/Tests/badge.svg" alt="Tests" />
</a>
<a href="https://codecov.io/gh/alteryx/compose">
<img src="https://codecov.io/gh/alteryx/compose/branch/main/graph/badge.svg?token=mDz4ueTUEO"/>
</a>
<a href="https://compose.alteryx.com/en/stable/?badge=stable" target="_blank">
<img src="https://readthedocs.com/projects/feature-labs-inc-compose/badge/?version=stable&token=5c3ace685cdb6e10eb67828a4dc74d09b20bb842980c8ee9eb4e9ed168d05b00"
alt="ReadTheDocs" />
</a>
<a href="https://badge.fury.io/py/composeml" target="_blank">
<img src="https://badge.fury.io/py/composeml.svg?maxAge=2592000" alt="PyPI Version" />
</a>
<a href="https://stackoverflow.com/questions/tagged/compose-ml" target="_blank">
<img src="https://img.shields.io/badge/questions-on_stackoverflow-blue.svg?" alt="StackOverflow" />
</a>
<a href="https://pepy.tech/project/composeml" target="_blank">
<img src="https://pepy.tech/badge/composeml/month" alt="PyPI Downloads" />
</a>
</p>
<hr>
[Compose](https://compose.alteryx.com) is a machine learning tool for automated prediction engineering. It allows you to structure prediction problems and generate labels for supervised learning. An end user defines an outcome of interest by writing a *labeling function*, then runs a search to automatically extract training examples from historical data. Its result is then provided to [Featuretools](https://docs.featuretools.com/) for automated feature engineering and subsequently to [EvalML](https://evalml.alteryx.com/) for automated machine learning. The workflow of an applied machine learning engineer then becomes:
<br><p align="center"><img width=90% src="https://raw.githubusercontent.com/alteryx/compose/main/docs/source/images/workflow.png" alt="Compose" /></p><br>
By automating the early stage of the machine learning pipeline, our end user can easily define a task and solve it. See the [documentation](https://compose.alteryx.com) for more information.
## Installation
Install with pip
```
python -m pip install composeml
```
or from the Conda-forge channel on [conda](https://anaconda.org/conda-forge/composeml):
```
conda install -c conda-forge composeml
```
### Add-ons
**Update checker** - Receive automatic notifications of new Compose releases
```
python -m pip install "composeml[update_checker]"
```
## Example
> Will a customer spend more than 300 in the next hour of transactions?
In this example, we automatically generate new training examples from a historical dataset of transactions.
```python
import composeml as cp
df = cp.demos.load_transactions()
df = df[df.columns[:7]]
df.head()
```
<table border="0" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>transaction_id</th>
<th>session_id</th>
<th>transaction_time</th>
<th>product_id</th>
<th>amount</th>
<th>customer_id</th>
<th>device</th>
</tr>
</thead>
<tbody>
<tr>
<td>298</td>
<td>1</td>
<td>2014-01-01 00:00:00</td>
<td>5</td>
<td>127.64</td>
<td>2</td>
<td>desktop</td>
</tr>
<tr>
<td>10</td>
<td>1</td>
<td>2014-01-01 00:09:45</td>
<td>5</td>
<td>57.39</td>
<td>2</td>
<td>desktop</td>
</tr>
<tr>
<td>495</td>
<td>1</td>
<td>2014-01-01 00:14:05</td>
<td>5</td>
<td>69.45</td>
<td>2</td>
<td>desktop</td>
</tr>
<tr>
<td>460</td>
<td>10</td>
<td>2014-01-01 02:33:50</td>
<td>5</td>
<td>123.19</td>
<td>2</td>
<td>tablet</td>
</tr>
<tr>
<td>302</td>
<td>10</td>
<td>2014-01-01 02:37:05</td>
<td>5</td>
<td>64.47</td>
<td>2</td>
<td>tablet</td>
</tr>
</tbody>
</table>
First, we represent the prediction problem with a labeling function and a label maker.
```python
def total_spent(ds):
return ds['amount'].sum()
label_maker = cp.LabelMaker(
target_dataframe_index="customer_id",
time_index="transaction_time",
labeling_function=total_spent,
window_size="1h",
)
```
Then, we run a search to automatically generate the training examples.
```python
label_times = label_maker.search(
df.sort_values('transaction_time'),
num_examples_per_instance=2,
minimum_data='2014-01-01',
drop_empty=False,
verbose=False,
)
label_times = label_times.threshold(300)
label_times.head()
```
<table border="0" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>customer_id</th>
<th>time</th>
<th>total_spent</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2014-01-01 00:00:00</td>
<td>True</td>
</tr>
<tr>
<td>1</td>
<td>2014-01-01 01:00:00</td>
<td>True</td>
</tr>
<tr>
<td>2</td>
<td>2014-01-01 00:00:00</td>
<td>False</td>
</tr>
<tr>
<td>2</td>
<td>2014-01-01 01:00:00</td>
<td>False</td>
</tr>
<tr>
<td>3</td>
<td>2014-01-01 00:00:00</td>
<td>False</td>
</tr>
</tbody>
</table>
We now have labels that are ready to use in [Featuretools](https://docs.featuretools.com/) to generate features.
## Support
The Innovation Labs open source community is happy to provide support to users of Compose. Project support can be found in three places depending on the type of question:
1. For usage questions, use [Stack Overflow](https://stackoverflow.com/questions/tagged/compose-ml) with the `composeml` tag.
2. For bugs, issues, or feature requests start a Github [issue](https://github.com/alteryx/compose/issues/new).
3. For discussion regarding development on the core library, use [Slack](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA).
4. For everything else, the core developers can be reached by email at open_source_support@alteryx.com
## Citing Compose
Compose is built upon a newly defined part of the machine learning process — prediction engineering. If you use Compose, please consider citing this paper:
James Max Kanter, Gillespie, Owen, Kalyan Veeramachaneni. [Label, Segment,Featurize: a cross domain framework for prediction engineering.](https://dai.lids.mit.edu/wp-content/uploads/2017/10/Pred_eng1.pdf) IEEE DSAA 2016.
BibTeX entry:
```bibtex
@inproceedings{kanter2016label,
title={Label, segment, featurize: a cross domain framework for prediction engineering},
author={Kanter, James Max and Gillespie, Owen and Veeramachaneni, Kalyan},
booktitle={2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
pages={430--439},
year={2016},
organization={IEEE}
}
```
## Acknowledgements
The open source development has been supported in part by DARPA's Data driven discovery of models program (D3M).
## Alteryx
**Compose** is an open source project maintained by [Alteryx](https://www.alteryx.com). We developed Compose to enable flexible definition of the machine learning task. To see the other open source projects we’re working on visit [Alteryx Open Source](https://www.alteryx.com/open-source). If building impactful data science pipelines is important to you or your business, please get in touch.
<p align="center">
<a href="https://www.alteryx.com/open-source">
<img src="https://alteryx-oss-web-images.s3.amazonaws.com/OpenSource_Logo-01.png" alt="Alteryx Open Source" width="800"/>
</a>
</p>
================================================
FILE: composeml/__init__.py
================================================
# flake8:noqa
from composeml.version import __version__
from composeml import demos, update_checker
from composeml.label_maker import LabelMaker
from composeml.label_times import LabelTimes, read_label_times
================================================
FILE: composeml/conftest.py
================================================
import pandas as pd
import pytest
from composeml import LabelTimes
from composeml.tests.utils import read_csv
@pytest.fixture(scope="session")
def transactions():
df = read_csv(
data=[
"time,amount,customer_id",
"2019-01-01 08:00:00,1,0",
"2019-01-01 08:30:00,1,0",
"2019-01-01 09:00:00,1,1",
"2019-01-01 09:30:00,1,1",
"2019-01-01 10:00:00,1,1",
"2019-01-01 10:30:00,1,2",
"2019-01-01 11:00:00,1,2",
"2019-01-01 11:30:00,1,2",
"2019-01-01 12:00:00,1,2",
"2019-01-01 12:30:00,1,3",
],
)
return df
@pytest.fixture(scope="session")
def total_spent_fn():
def total_spent(df):
value = df.amount.sum()
return value
return total_spent
@pytest.fixture(scope="session")
def unique_amounts_fn():
def unique_amounts(df):
return df.amount.nunique()
return unique_amounts
@pytest.fixture
def total_spent():
data = [
"customer_id,time,total_spent",
"0,2019-01-01 08:00:00,9",
"0,2019-01-01 08:30:00,8",
"1,2019-01-01 09:00:00,7",
"1,2019-01-01 09:30:00,6",
"1,2019-01-01 10:00:00,5",
"2,2019-01-01 10:30:00,4",
"2,2019-01-01 11:00:00,3",
"2,2019-01-01 11:30:00,2",
"2,2019-01-01 12:00:00,1",
"3,2019-01-01 12:30:00,0",
]
data = read_csv(data, parse_dates=["time"])
kwargs = {
"data": data,
"target_columns": ["total_spent"],
"target_dataframe_index": "customer_id",
"search_settings": {
"num_examples_per_instance": -1,
},
}
label_times = LabelTimes(**kwargs)
return label_times
@pytest.fixture
def labels():
records = [
{
"label_id": 0,
"customer_id": 1,
"time": "2014-01-01 00:45:00",
"my_labeling_function": 226.92999999999998,
},
{
"label_id": 1,
"customer_id": 1,
"time": "2014-01-01 00:48:00",
"my_labeling_function": 47.95,
},
{
"label_id": 2,
"customer_id": 2,
"time": "2014-01-01 00:01:00",
"my_labeling_function": 283.46000000000004,
},
{
"label_id": 3,
"customer_id": 2,
"time": "2014-01-01 00:04:00",
"my_labeling_function": 31.54,
},
]
dtype = {"time": "datetime64[ns]"}
values = pd.DataFrame(records).astype(dtype).set_index("label_id")
values = values[["customer_id", "time", "my_labeling_function"]]
values = LabelTimes(
values,
target_columns=["my_labeling_function"],
target_dataframe_index="customer_id",
)
return values
@pytest.fixture(autouse=True)
def add_labels(doctest_namespace, labels):
doctest_namespace["labels"] = labels
================================================
FILE: composeml/data_slice/__init__.py
================================================
# flake8:noqa
from composeml.data_slice.generator import DataSliceGenerator
================================================
FILE: composeml/data_slice/extension.py
================================================
import pandas as pd
from composeml.data_slice.offset import DataSliceOffset, DataSliceStep
class DataSliceContext:
"""Tracks contextual attributes about a data slice."""
def __init__(
self,
slice_number=0,
slice_start=None,
slice_stop=None,
next_start=None,
):
"""Creates the data slice context.
Args:
slice_number (int): The latest count of data slices.
slice_start (int or Timestamp): When the data slice starts.
slice_stop (int or Timestamp): When the data slice stops.
next_start (int or Timestamp): When the next data slice starts.
"""
self.next_start = next_start
self.slice_stop = slice_stop
self.slice_start = slice_start
self.slice_number = slice_number
def __repr__(self):
"""Represents the data slice context as a string."""
return self._series.fillna("").to_string()
@property
def _series(self):
"""Represents the data slice context as a pandas series."""
keys = reversed(list(vars(self)))
attrs = {key: getattr(self, key) for key in keys}
context = pd.Series(attrs, name="context")
return context
@property
def count(self):
"""Alias for the data slice number."""
return self.slice_number
@property
def start(self):
"""Alias for the start point of a data slice."""
return self.slice_start
@property
def stop(self):
"""Alias for the stopping point of a data slice."""
return self.slice_stop
class DataSliceFrame(pd.DataFrame):
"""Subclasses pandas data frame for data slice."""
_metadata = ["context"]
@property
def _constructor(self):
return DataSliceFrame
@property
def ctx(self):
"""Alias for the data slice context."""
return self.context
@pd.api.extensions.register_dataframe_accessor("slice")
class DataSliceExtension:
def __init__(self, df):
self._df = df
def __call__(self, size=None, start=None, stop=None, step=None, drop_empty=True):
"""Returns a data slice generator based on the data frame.
Args:
size (int or str): The size of each data slice. A string represents a timedelta or frequency.
An integer represents the number of rows. The default value is the length of the data frame.
start (int or str): Where to start the first data slice.
stop (int or str): Where to stop generating data slices.
step (int or str): The step size between data slices. The default value is the data slice size.
drop_empty (bool): Whether to drop empty data slices. The default value is True.
Returns:
ds (generator): Returns a generator of data slices.
"""
self._check_index()
offsets = self._check_offsets(size, start, stop, step)
generator = self._apply(*offsets, drop_empty=drop_empty)
return generator
def __getitem__(self, offset):
"""Generates data slices from a slice object."""
if not isinstance(offset, slice):
raise TypeError("must be a slice object")
return self(size=offset.step, start=offset.start, stop=offset.stop)
def _apply(self, size, start, stop, step, drop_empty=True):
"""Generates data slices based on the data frame."""
df = self._apply_start(self._df, start, step)
if df.empty and drop_empty:
return df
df, slice_number = DataSliceFrame(df), 1
while start.value and start.value <= stop.value:
if df.empty and drop_empty:
break
ds = self._apply_size(df, start, size)
df = self._apply_step(df, start, step)
if ds.empty and drop_empty:
continue
ds.context.next_start = start.value
ds.context.slice_number = slice_number
slice_number += 1
yield ds
def _apply_size(self, df, start, size):
"""Returns a data slice calculated by the offsets."""
if size._is_offset_position:
index = self._get_index(df, size.value)
stop = index or self._last_index
ds = df.iloc[: size.value]
else:
stop = start.value + size.value
ds = df[:stop]
# Pandas includes both endpoints when slicing by time.
# This results in the right endpoint overlapping in consecutive data slices.
# Resolved by making the right endpoint exclusive.
# https://pandas.pydata.org/pandas-docs/version/0.19/gotchas.html#endpoints-are-inclusive
if not ds.empty:
overlap = ds.index == stop
if overlap.any():
ds = ds[~overlap]
ds.context = DataSliceContext(slice_start=start.value, slice_stop=stop)
return ds
def _apply_start(self, df, start, step):
"""Removes data before the index calculated by the offset."""
inplace = start.value == self._first_index
if start._is_offset_position and not inplace:
df = df.iloc[start.value :]
first_index = df.first_valid_index()
start.value = self._first_index = first_index
if start._is_offset_timestamp and not inplace:
df = df[df.index >= start.value]
if step._is_offset_position:
first_index = df.first_valid_index()
start.value = self._first_index = first_index
return df
def _apply_step(self, df, start, step):
"""Strides the first index by the offset."""
if step._is_offset_position:
df = df.iloc[step.value :]
first_index = df.first_valid_index()
start.value = first_index
else:
start.value += step.value
df = df[start.value :]
return df
def _check_index(self):
"""Checks if index values are null or unsorted."""
null = self._df.index.isnull().any()
assert not null, "index contains null values"
assert self._is_sorted, "data frame must be sorted chronologically"
self._first_index = self._df.first_valid_index()
self._last_index = self._df.last_valid_index()
def _check_offsets(self, size, start, stop, step):
"""Checks for valid data slice offsets."""
size = self._check_size(size or len(self._df))
start = self._check_start(start or self._first_index)
stop = self._check_stop(stop or self._last_index)
step = self._check_step(step or size)
offsets = size, start, stop, step
if any(offset._is_offset_frequency for offset in offsets):
info = "offset by frequency requires a time index"
assert self._is_time_index, info
return offsets
def _check_size(self, size):
"""Checks for valid offset size."""
if not isinstance(size, DataSliceStep):
size = DataSliceStep(size)
assert size._is_positive, "offset must be positive"
return size
def _check_start(self, start):
"""Checks for valid offset start."""
if not isinstance(start, DataSliceOffset):
start = DataSliceOffset(start)
if start._is_offset_frequency:
start.value += self._first_index
return start
def _check_step(self, step):
"""Checks for valid offset step."""
if not isinstance(step, DataSliceStep):
step = DataSliceStep(step)
assert step._is_positive, "offset must be positive"
return step
def _check_stop(self, stop):
"""Checks for valid offset stop."""
if not isinstance(stop, DataSliceOffset):
stop = DataSliceOffset(stop)
if stop._is_offset_frequency:
base = "first" if stop._is_positive else "last"
value = getattr(self, f"_{base}_index")
stop.value += value
inplace = stop.value == self._last_index
if stop._is_offset_position and not inplace:
index = self._get_index(self._df, stop.value)
stop.value = index or self._last_index
return stop
def _get_index(self, df, i):
"""Helper function for getting index values."""
if i < df.index.size and df.index.size > 0:
return df.index[i]
@property
def _is_sorted(self):
"""Whether index values are sorted."""
return self._df.index.is_monotonic_increasing
@property
def _is_time_index(self):
"""Whether the data frame has a time index type."""
return pd.api.types.is_datetime64_any_dtype(self._df.index)
================================================
FILE: composeml/data_slice/generator.py
================================================
from composeml.data_slice.extension import DataSliceContext, DataSliceFrame
class DataSliceGenerator:
"""Generates data slices for the lable maker."""
def __init__(
self,
window_size,
gap=None,
min_data=None,
max_data=None,
drop_empty=True,
):
self.window_size = window_size
self.gap = gap
self.min_data = min_data
self.max_data = max_data
self.drop_empty = drop_empty
def __call__(self, df):
"""Applies the data slice generator to the data frame."""
is_column = self.window_size in df
method = "column" if is_column else "time"
attr = f"_slice_by_{method}"
return getattr(self, attr)(df)
def _slice_by_column(self, df):
"""Slices the data frame by an existing column."""
slices = df.groupby(self.window_size, sort=False)
slice_number = 1
for group, ds in slices:
ds = DataSliceFrame(ds)
ds.context = DataSliceContext(
slice_number=slice_number,
slice_start=ds.first_valid_index(),
slice_stop=ds.last_valid_index(),
)
setattr(ds.context, self.window_size, group)
del ds.context.next_start
slice_number += 1
yield ds
def _slice_by_time(self, df):
"""Slices the data frame along the time index."""
data_slices = df.slice(
size=self.window_size,
start=self.min_data,
stop=self.max_data,
step=self.gap,
drop_empty=self.drop_empty,
)
for ds in data_slices:
yield ds
================================================
FILE: composeml/data_slice/offset.py
================================================
import re
import pandas as pd
class DataSliceOffset:
"""Offsets for calculating data slice indices."""
def __init__(self, value):
self.value = value
self._check()
def _check(self):
"""Checks if the value is a valid offset."""
if isinstance(self.value, str):
self._parse_value()
assert self._is_valid_offset, self._invalid_offset_error
@property
def _is_offset_base(self):
"""Whether offset is a base type."""
return issubclass(type(self.value), pd.tseries.offsets.BaseOffset)
@property
def _is_offset_position(self):
"""Whether offset is integer-location based."""
return pd.api.types.is_integer(self.value)
@property
def _is_offset_timedelta(self):
"""Whether offset is a timedelta."""
return isinstance(self.value, pd.Timedelta)
@property
def _is_offset_timestamp(self):
"""Whether offset is a timestamp."""
return isinstance(self.value, pd.Timestamp)
@property
def _is_offset_frequency(self):
"""Whether offset is a base type or timedelta."""
value = self._is_offset_base
value |= self._is_offset_timedelta
return value
def __int__(self):
"""Typecasts offset value to an integer."""
if self._is_offset_position:
return self.value
elif self._is_offset_base:
return self.value.n
elif self._is_offset_timedelta:
return self.value.value
else:
raise TypeError("offset must be position or frequency based")
def __float__(self):
"""Typecasts offset value to a float."""
if self._is_offset_timestamp:
return self.value.timestamp()
else:
raise TypeError("offset must be a timestamp")
@property
def _is_positive(self):
"""Whether the offset value is positive."""
timestamp = self._is_offset_timestamp
numeric = float if timestamp else int
return numeric(self) > 0
@property
def _is_valid_offset(self):
"""Whether offset is a valid type."""
value = self._is_offset_position
value |= self._is_offset_frequency
value |= self._is_offset_timestamp
return value
@property
def _invalid_offset_error(self):
"""Returns message for invalid offset."""
info = "offset must be position or time based\n\n"
info += "\tFor information about offset aliases, visit the link below.\n"
info += (
"\thttps://pandas.pydata.org/docs/user_guide/timeseries.html#offset-aliases"
)
return info
def _parse_offset_alias(self, alias):
"""Parses an alias to an offset."""
value = self._parse_offset_alias_phrase(alias)
value = value or pd.tseries.frequencies.to_offset(alias)
return value
def _parse_offset_alias_phrase(self, value):
"""Parses an alias phrase to an offset."""
pattern = re.compile("until start of next (?P<unit>[a-z]+)")
match = pattern.search(value.lower())
if match:
match = match.groupdict()
unit = match["unit"]
if unit == "month":
return pd.offsets.MonthBegin()
if unit == "year":
return pd.offsets.YearBegin()
def _parse_value(self):
"""Parses the value to an offset."""
for parser in self._parsers:
try:
value = parser(self.value)
if value is not None:
self.value = value
break
except Exception:
continue
@property
def _parsers(self):
"""Returns the value parsers."""
return pd.Timestamp, self._parse_offset_alias, pd.Timedelta
class DataSliceStep(DataSliceOffset):
@property
def _is_valid_offset(self):
"""Whether offset is a valid type."""
value = self._is_offset_position
value |= self._is_offset_frequency
return value
@property
def _parsers(self):
"""Returns the value parsers."""
return self._parse_offset_alias, pd.Timedelta
================================================
FILE: composeml/demos/__init__.py
================================================
import os
import pandas as pd
DATA = os.path.join(os.path.dirname(__file__))
def load_transactions():
path = os.path.join(DATA, "transactions.csv")
df = pd.read_csv(path, parse_dates=["transaction_time"])
return df
================================================
FILE: composeml/demos/transactions.csv
================================================
transaction_id,session_id,transaction_time,product_id,amount,customer_id,device,session_start,zip_code,join_date,date_of_birth,brand
298,1,2014-01-01 00:00:00,5,127.64,2,desktop,2014-01-01 00:00:00,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
10,1,2014-01-01 00:09:45,5,57.39,2,desktop,2014-01-01 00:00:00,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
495,1,2014-01-01 00:14:05,5,69.45,2,desktop,2014-01-01 00:00:00,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
460,10,2014-01-01 02:33:50,5,123.19,2,tablet,2014-01-01 02:31:40,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
302,10,2014-01-01 02:37:05,5,64.47,2,tablet,2014-01-01 02:31:40,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
212,10,2014-01-01 02:41:25,5,52.28,2,tablet,2014-01-01 02:31:40,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
440,10,2014-01-01 02:44:40,5,50.45,2,tablet,2014-01-01 02:31:40,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
405,15,2014-01-01 03:42:05,5,47.39,2,desktop,2014-01-01 03:41:00,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
180,15,2014-01-01 03:48:35,5,146.81,2,desktop,2014-01-01 03:41:00,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
220,16,2014-01-01 03:55:05,5,135.48,2,desktop,2014-01-01 03:49:40,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
253,17,2014-01-01 04:00:30,5,41.95,2,tablet,2014-01-01 04:00:30,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
340,17,2014-01-01 04:08:05,5,100.99,2,tablet,2014-01-01 04:00:30,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
301,31,2014-01-01 07:49:05,5,66.86,2,mobile,2014-01-01 07:42:35,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
346,31,2014-01-01 07:51:15,5,18.81,2,mobile,2014-01-01 07:42:35,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
161,31,2014-01-01 07:55:35,5,75.96,2,mobile,2014-01-01 07:42:35,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
420,31,2014-01-01 07:59:55,5,66.1,2,mobile,2014-01-01 07:42:35,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
468,33,2014-01-01 08:11:50,5,46.99,2,mobile,2014-01-01 08:10:45,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
281,33,2014-01-01 08:15:05,5,86.81,2,mobile,2014-01-01 08:10:45,13244,2012-04-15 23:31:04,1986-08-18 00:00:00,A
270,2,2014-01-01 00:18:25,5,123.53,5,mobile,2014-01-01 00:17:20,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
453,2,2014-01-01 00:19:30,5,9.32,5,mobile,2014-01-01 00:17:20,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
74,2,2014-01-01 00:23:50,5,90.69,5,mobile,2014-01-01 00:17:20,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
207,2,2014-01-01 00:24:55,5,48.27,5,mobile,2014-01-01 00:17:20,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
122,2,2014-01-01 00:27:05,5,13.81,5,mobile,2014-01-01 00:17:20,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
40,20,2014-01-01 04:46:00,5,53.22,5,desktop,2014-01-01 04:46:00,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
377,20,2014-01-01 05:01:10,5,83.33,5,desktop,2014-01-01 04:46:00,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
206,24,2014-01-01 05:48:50,5,61.3,5,tablet,2014-01-01 05:44:30,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
94,24,2014-01-01 05:55:20,5,100.42,5,tablet,2014-01-01 05:44:30,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
84,24,2014-01-01 05:57:30,5,75.75,5,tablet,2014-01-01 05:44:30,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
256,28,2014-01-01 06:51:40,5,101.39,5,mobile,2014-01-01 06:50:35,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
292,28,2014-01-01 06:53:50,5,138.17,5,mobile,2014-01-01 06:50:35,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
490,28,2014-01-01 07:07:55,5,149.02,5,mobile,2014-01-01 06:50:35,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
154,28,2014-01-01 07:09:00,5,44.11,5,mobile,2014-01-01 06:50:35,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
73,30,2014-01-01 07:29:35,5,42.94,5,desktop,2014-01-01 07:27:25,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
240,30,2014-01-01 07:40:25,5,59.71,5,desktop,2014-01-01 07:27:25,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
297,32,2014-01-01 08:04:15,5,20.65,5,mobile,2014-01-01 08:02:05,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
391,32,2014-01-01 08:07:30,5,57.88,5,mobile,2014-01-01 08:02:05,60091,2010-07-17 05:27:50,1984-07-28 00:00:00,A
461,3,2014-01-01 00:39:00,5,102.76,4,mobile,2014-01-01 00:28:10,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
44,3,2014-01-01 00:43:20,5,147.73,4,mobile,2014-01-01 00:28:10,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
327,5,2014-01-01 01:12:35,5,20.06,4,mobile,2014-01-01 01:11:30,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
48,5,2014-01-01 01:14:45,5,131.29,4,mobile,2014-01-01 01:11:30,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
442,5,2014-01-01 01:16:55,5,97.18,4,mobile,2014-01-01 01:11:30,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
285,8,2014-01-01 01:55:55,5,51.69,4,tablet,2014-01-01 01:55:55,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
225,8,2014-01-01 02:06:45,5,124.13,4,tablet,2014-01-01 01:55:55,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
254,11,2014-01-01 02:52:15,5,118.51,4,mobile,2014-01-01 02:47:55,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
465,11,2014-01-01 02:56:35,5,66.95,4,mobile,2014-01-01 02:47:55,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
487,11,2014-01-01 03:03:05,5,27.02,4,mobile,2014-01-01 02:47:55,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
117,12,2014-01-01 03:04:10,5,101.84,4,desktop,2014-01-01 03:04:10,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
196,12,2014-01-01 03:05:15,5,29.37,4,desktop,2014-01-01 03:04:10,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
226,21,2014-01-01 05:07:40,5,77.78,4,desktop,2014-01-01 05:02:15,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
288,21,2014-01-01 05:09:50,5,55.74,4,desktop,2014-01-01 05:02:15,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
494,21,2014-01-01 05:10:55,5,109.3,4,desktop,2014-01-01 05:02:15,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
380,21,2014-01-01 05:14:10,5,57.09,4,desktop,2014-01-01 05:02:15,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
236,21,2014-01-01 05:17:25,5,69.62,4,desktop,2014-01-01 05:02:15,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
87,21,2014-01-01 05:18:30,5,7.93,4,desktop,2014-01-01 05:02:15,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
109,22,2014-01-01 05:30:25,5,82.69,4,desktop,2014-01-01 05:21:45,60091,2011-04-08 20:08:14,2006-08-15 00:00:00,A
275,4,2014-01-01 00:45:30,5,108.11,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
101,4,2014-01-01 00:46:35,5,112.53,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
80,4,2014-01-01 00:47:40,5,6.29,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
163,4,2014-01-01 00:52:00,5,31.37,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
293,4,2014-01-01 00:53:05,5,82.88,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
103,4,2014-01-01 00:57:25,5,20.79,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
488,4,2014-01-01 01:03:55,5,129.0,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
413,4,2014-01-01 01:05:00,5,119.98,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
191,6,2014-01-01 01:31:00,5,139.23,1,tablet,2014-01-01 01:23:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
372,6,2014-01-01 01:37:30,5,114.84,1,tablet,2014-01-01 01:23:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
387,6,2014-01-01 01:38:35,5,49.71,1,tablet,2014-01-01 01:23:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
287,9,2014-01-01 02:28:25,5,50.94,1,desktop,2014-01-01 02:15:25,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
190,14,2014-01-01 03:29:05,5,110.52,1,tablet,2014-01-01 03:28:00,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
7,14,2014-01-01 03:39:55,5,107.42,1,tablet,2014-01-01 03:28:00,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
19,18,2014-01-01 04:14:35,5,133.49,1,desktop,2014-01-01 04:14:35,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
392,18,2014-01-01 04:17:50,5,72.67,1,desktop,2014-01-01 04:14:35,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
398,26,2014-01-01 06:18:05,5,27.95,1,tablet,2014-01-01 06:17:00,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
152,26,2014-01-01 06:26:45,5,42.81,1,tablet,2014-01-01 06:17:00,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
221,26,2014-01-01 06:31:05,5,7.08,1,tablet,2014-01-01 06:17:00,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
403,27,2014-01-01 06:35:25,5,28.26,1,mobile,2014-01-01 06:34:20,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
368,27,2014-01-01 06:36:30,5,139.43,1,mobile,2014-01-01 06:34:20,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
334,27,2014-01-01 06:38:40,5,54.26,1,mobile,2014-01-01 06:34:20,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
333,27,2014-01-01 06:44:05,5,103.2,1,mobile,2014-01-01 06:34:20,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
339,27,2014-01-01 06:45:10,5,26.56,1,mobile,2014-01-01 06:34:20,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
43,27,2014-01-01 06:47:20,5,55.26,1,mobile,2014-01-01 06:34:20,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
199,27,2014-01-01 06:48:25,5,5.91,1,mobile,2014-01-01 06:34:20,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
355,29,2014-01-01 07:11:10,5,110.68,1,mobile,2014-01-01 07:10:05,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
352,29,2014-01-01 07:13:20,5,92.43,1,mobile,2014-01-01 07:10:05,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
182,29,2014-01-01 07:16:35,5,125.73,1,mobile,2014-01-01 07:10:05,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
177,29,2014-01-01 07:19:50,5,55.11,1,mobile,2014-01-01 07:10:05,60091,2011-04-17 10:48:33,1994-07-18 00:00:00,A
259,7,2014-01-01 01:45:05,5,32.85,3,tablet,2014-01-01 01:39:40,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
274,7,2014-01-01 01:46:10,5,14.45,3,tablet,2014-01-01 01:39:40,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
214,7,2014-01-01 01:51:35,5,101.58,3,tablet,2014-01-01 01:39:40,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
441,19,2014-01-01 04:30:50,5,9.34,3,desktop,2014-01-01 04:27:35,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
146,19,2014-01-01 04:38:25,5,126.74,3,desktop,2014-01-01 04:27:35,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
483,19,2014-01-01 04:43:50,5,60.17,3,desktop,2014-01-01 04:27:35,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
159,23,2014-01-01 05:32:35,5,43.69,3,desktop,2014-01-01 05:32:35,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
186,23,2014-01-01 05:40:10,5,128.26,3,desktop,2014-01-01 05:32:35,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
378,25,2014-01-01 06:15:55,5,131.83,3,desktop,2014-01-01 05:59:40,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
110,34,2014-01-01 08:24:50,5,145.74,3,desktop,2014-01-01 08:24:50,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
497,34,2014-01-01 08:29:10,5,148.86,3,desktop,2014-01-01 08:24:50,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
467,34,2014-01-01 08:32:25,5,145.19,3,desktop,2014-01-01 08:24:50,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
267,34,2014-01-01 08:38:55,5,58.47,3,desktop,2014-01-01 08:24:50,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
493,35,2014-01-01 08:48:40,5,132.94,3,mobile,2014-01-01 08:44:20,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
338,35,2014-01-01 08:51:55,5,93.71,3,mobile,2014-01-01 08:44:20,13244,2011-08-13 15:42:34,2003-11-21 00:00:00,A
================================================
FILE: composeml/label_maker.py
================================================
from sys import stdout
from pandas import Series
from pandas.api.types import is_categorical_dtype
from tqdm import tqdm
from composeml.data_slice import DataSliceGenerator
from composeml.label_search import ExampleSearch, LabelSearch
from composeml.label_times import LabelTimes
class LabelMaker:
"""Automatically makes labels for prediction problems."""
def __init__(
self,
target_dataframe_index,
time_index,
labeling_function=None,
window_size=None,
):
"""Creates an instance of label maker.
Args:
target_dataframe_index (str): The index of the target dataframe, from which labels will be created.
time_index (str): Name of time column in the data frame.
labeling_function (function or list(function) or dict(str=function)): Function, list of functions, or dictionary of functions that transform a data slice.
When set as a dictionary, the key is used as the name of the labeling function.
window_size (str or int): Size of the data slices. As a string, the value can be a timedelta or a column in the data frame to group by.
As an integer, the value can be the number of rows. Default value is all future data.
"""
self.labeling_function = labeling_function or {}
self.target_dataframe_index = target_dataframe_index
self.time_index = time_index
self.window_size = window_size
def _name_labeling_function(self, function):
"""Gets the names of the labeling functions."""
has_name = hasattr(function, "__name__")
return function.__name__ if has_name else type(function).__name__
def _check_labeling_function(self, function, name=None):
"""Checks whether the labeling function is callable."""
assert callable(function), "labeling function must be callabe"
return function
@property
def labeling_function(self):
"""Gets the labeling function(s)."""
return self._labeling_function
@labeling_function.setter
def labeling_function(self, value):
"""Sets and formats the intial labeling function(s).
Args:
value (function or list(function) or dict(str=function)): Function that transforms a data slice to a label.
"""
if isinstance(value, dict):
for name, function in value.items():
self._check_labeling_function(function)
assert isinstance(name, str), "labeling function name must be string"
if callable(value):
value = [value]
if isinstance(value, (tuple, list)):
value = {
self._name_labeling_function(function): self._check_labeling_function(
function,
)
for function in value
}
assert isinstance(value, dict), "value type for labeling function not supported"
self._labeling_function = value
def _check_cutoff_time(self, value):
if isinstance(value, Series):
if value.index.is_unique:
return value.to_dict()
else:
raise ValueError("more than one cutoff time exists for a target group")
else:
return value
def slice(
self,
df,
num_examples_per_instance,
minimum_data=None,
maximum_data=None,
gap=None,
drop_empty=True,
):
"""Generates data slices of target dataframe.
Args:
df (DataFrame): Data frame to create slices on.
num_examples_per_instance (int): Number of examples per unique instance of target dataframe.
minimum_data (int or str or Series): The amount of data needed before starting the search. Defaults to the first value in the time index.
The value can be a datetime string to directly set the first cutoff time or a timedelta string to denote the amount of data needed before
the first cutoff time. The value can also be an integer to denote the number of rows needed before the first cutoff time.
If a Series, minimum_data should be datetime string, timedelta string, or integer values with a unique set of target groups as the corresponding index.
maximum_data (str): Maximum data before stopping the search. Default value is last time of index.
gap (str or int): Time between examples. Default value is window size.
If an integer, search will start on the first event after the minimum data.
drop_empty (bool): Whether to drop empty slices. Default value is True.
Returns:
ds (generator): Returns a generator of data slices.
"""
self._check_example_count(num_examples_per_instance, gap)
df = self.set_index(df)
target_groups = df.groupby(self.target_dataframe_index)
num_examples_per_instance = ExampleSearch._check_number(
num_examples_per_instance,
)
minimum_data = self._check_cutoff_time(minimum_data)
minimum_data_varies = isinstance(minimum_data, dict)
for group_key, df in target_groups:
if minimum_data_varies:
if group_key not in minimum_data:
continue
min_data_for_group = minimum_data[group_key]
else:
min_data_for_group = minimum_data
generator = DataSliceGenerator(
window_size=self.window_size,
min_data=min_data_for_group,
max_data=maximum_data,
drop_empty=drop_empty,
gap=gap,
)
for ds in generator(df):
setattr(ds.context, self.target_dataframe_index, group_key)
yield ds
if ds.context.slice_number >= num_examples_per_instance:
break
@property
def _bar_format(self):
"""Template to format the progress bar during a label search."""
value = "Elapsed: {elapsed} | "
value += "Remaining: {remaining} | "
value += "Progress: {l_bar}{bar}| "
value += self.target_dataframe_index + ": {n}/{total} "
return value
def _check_example_count(self, num_examples_per_instance, gap):
"""Checks whether example count corresponds to data slices."""
if self.window_size is None and gap is None:
more_than_one = num_examples_per_instance > 1
assert (
not more_than_one
), "must specify gap if num_examples > 1 and window size = none"
def search(
self,
df,
num_examples_per_instance,
minimum_data=None,
maximum_data=None,
gap=None,
drop_empty=True,
verbose=True,
*args,
**kwargs,
):
"""Searches the data to calculates labels.
Args:
df (DataFrame): Data frame to search and extract labels.
num_examples_per_instance (int or dict): The expected number of examples to return from each dataframe group.
A dictionary can be used to further specify the expected number of examples to return from each label.
minimum_data (int or str or Series): The amount of data needed before starting the search. Defaults to the first value in the time index.
The value can be a datetime string to directly set the first cutoff time or a timedelta string to denote the amount of data needed before
the first cutoff time. The value can also be an integer to denote the number of rows needed before the first cutoff time.
If a Series, minimum_data should be datetime string, timedelta string, or integer values with a unique set of target groups as the corresponding index.
maximum_data (str): Maximum data before stopping the search. Defaults to the last value in the time index.
gap (str or int): Time between examples. Default value is window size.
If an integer, search will start on the first event after the minimum data.
drop_empty (bool): Whether to drop empty slices. Default value is True.
verbose (bool): Whether to render progress bar. Default value is True.
*args: Positional arguments for labeling function.
**kwargs: Keyword arguments for labeling function.
Returns:
lt (LabelTimes): Calculated labels with cutoff times.
"""
assert self.labeling_function, "missing labeling function(s)"
self._check_example_count(num_examples_per_instance, gap)
is_label_search = isinstance(num_examples_per_instance, dict)
search = (LabelSearch if is_label_search else ExampleSearch)(
num_examples_per_instance,
)
# check minimum data cutoff time
minimum_data = self._check_cutoff_time(minimum_data)
minimum_data_varies = isinstance(minimum_data, dict)
df = self.set_index(df)
total = search.expected_count if search.is_finite else 1
# If the target is categorical, make sure there are no unused categories
if is_categorical_dtype(df[self.target_dataframe_index]):
df[self.target_dataframe_index] = df[
self.target_dataframe_index
].cat.remove_unused_categories()
target_groups = df.groupby(self.target_dataframe_index)
total *= target_groups.ngroups
progress_bar = tqdm(
total=total,
file=stdout,
disable=not verbose,
bar_format=self._bar_format,
)
records = []
for group_count, (group_key, df) in enumerate(target_groups, start=1):
if minimum_data_varies:
if group_key not in minimum_data:
continue
min_data_for_group = minimum_data[group_key]
else:
min_data_for_group = minimum_data
generator = DataSliceGenerator(
window_size=self.window_size,
min_data=min_data_for_group,
max_data=maximum_data,
drop_empty=drop_empty,
gap=gap,
)
for ds in generator(df):
setattr(ds.context, self.target_dataframe_index, group_key)
items = self.labeling_function.items()
labels = {name: lf(ds, *args, **kwargs) for name, lf in items}
valid_labels = search.is_valid_labels(labels)
if not valid_labels:
continue
records.append(
{
self.target_dataframe_index: group_key,
"time": ds.context.slice_start,
**labels,
},
)
search.update_count(labels)
# if finite search, update progress bar for the example found
if search.is_finite:
progress_bar.update(n=1)
if search.is_complete:
break
# if finite search, update progress bar for missing examples
if search.is_finite:
progress_bar.update(
n=group_count * search.expected_count - progress_bar.n,
)
else:
progress_bar.update(
n=1,
) # otherwise, update progress bar once for each group
search.reset_count()
total -= progress_bar.n
progress_bar.update(n=total)
progress_bar.close()
lt = LabelTimes(
data=records,
target_columns=list(self.labeling_function),
target_dataframe_index=self.target_dataframe_index,
search_settings={
"num_examples_per_instance": num_examples_per_instance,
"minimum_data": minimum_data,
"maximum_data": str(maximum_data),
"window_size": str(self.window_size),
"gap": str(gap),
},
)
return lt
def set_index(self, df):
"""Sets the time index in a data frame (if not already set).
Args:
df (DataFrame): Data frame to set time index in.
Returns:
df (DataFrame): Data frame with time index set.
"""
if df.index.name != self.time_index:
df = df.set_index(self.time_index)
if "time" not in str(df.index.dtype):
df.index = df.index.astype("datetime64[ns]")
return df
================================================
FILE: composeml/label_search.py
================================================
from collections import Counter
from pandas import isnull
class ExampleSearch:
"""A label search based on the number of examples.
Args:
expected_count (int): The expected number of examples to find.
"""
def __init__(self, expected_count):
self.expected_count = self._check_number(expected_count)
self.reset_count()
@staticmethod
def _check_number(n):
"""Checks and formats the expected number of examples."""
if n == -1 or n == "inf":
return float("inf")
else:
info = "expected count must be numeric"
assert isinstance(n, (int, float)), info
return n
@staticmethod
def _is_finite_number(n):
"""Checks if a number if finite."""
return n > 0 and abs(n) != float("inf")
@property
def is_complete(self):
"""Whether the search has found the expected number of examples."""
return self.actual_count >= self.expected_count
@property
def is_finite(self):
"""Whether the expected number of examples is a finite number."""
return self._is_finite_number(self.expected_count)
def is_valid_labels(self, labels):
"""Whether the label values are not null."""
return not any(map(isnull, labels.values()))
def reset_count(self):
"""Reset the internal count of actual labels."""
self.actual_count = 0
def update_count(self, labels):
"""Update the internal count of actual labels."""
self.actual_count += 1
class LabelSearch(ExampleSearch):
"""A label search based on the number of examples for each label.
Args:
expected_label_counts (dict): The expected number of examples to be find for each label.
The dictionary should map a label to the number of examples to find for the label.
"""
def __init__(self, expected_label_counts):
items = expected_label_counts.items()
self.expected_label_counts = Counter(
{label: self._check_number(count) for label, count in items},
)
self.expected_count = sum(self.expected_label_counts.values())
self.actual_label_counts = Counter()
@property
def is_complete(self):
"""Whether the search has found the expected number of examples for each label."""
return len(self.expected_label_counts - self.actual_label_counts) == 0
def is_complete_label(self, label):
"""Whether the search has found the expected number of examples for a label."""
return (
self.actual_label_counts.get(label, 0) >= self.expected_label_counts[label]
)
def is_valid_labels(self, labels):
"""Whether label values meet the search criteria.
The search criteria is defined as label values that are not null, expected by the user, and have not reached the expected count.
When these conditions are met by any label value, the labels are set to return to the user.
This includes the other label values which share the same cutoff time.
Args:
labels (dict): The actual label values found during a search.
Returns:
value (bool): The value is True when valid, otherwise False.
"""
label_values = labels.values()
not_null = super().is_valid_labels(labels)
is_expected = not_null and any(
label in self.expected_label_counts for label in label_values
)
value = is_expected and any(
not self.is_complete_label(label) for label in label_values
)
return value
def reset_count(self):
"""Reset the internal count of actual labels."""
self.actual_label_counts.clear()
def update_count(self, labels):
"""Update the internal count of the actual labels.
Args:
labels (dict): The actual label values found during a search.
"""
self.actual_label_counts.update(labels.values())
================================================
FILE: composeml/label_times/__init__.py
================================================
# flake8:noqa
from composeml.label_times.deserialize import read_label_times
from composeml.label_times.object import LabelTimes
================================================
FILE: composeml/label_times/description.py
================================================
import pandas as pd
def describe_label_times(label_times):
"""Prints out label info with transform settings that reproduce labels."""
target_column = label_times.target_columns[0]
is_discrete = label_times.is_discrete[target_column]
if is_discrete:
distribution = label_times[target_column].value_counts()
distribution.sort_index(inplace=True)
distribution.index = distribution.index.astype("str")
distribution["Total:"] = distribution.sum()
else:
distribution = label_times[target_column].describe()
print("Label Distribution\n" + "-" * 18, end="\n")
print(distribution.to_string(), end="\n\n\n")
metadata = label_times.settings
target_column = metadata["label_times"]["target_columns"][0]
target_type = metadata["label_times"]["target_types"][target_column]
target_dataframe_index = metadata["label_times"]["target_dataframe_index"]
settings = {
"target_column": target_column,
"target_dataframe_index": target_dataframe_index,
"target_type": target_type,
}
settings.update(metadata["label_times"]["search_settings"])
settings = pd.Series(settings)
print("Settings\n" + "-" * 8, end="\n")
settings.sort_index(inplace=True)
print(settings.to_string(), end="\n\n\n")
print("Transforms\n" + "-" * 10, end="\n")
transforms = metadata["label_times"]["transforms"]
for step, transform in enumerate(transforms):
transform = pd.Series(transform)
transform.sort_index(inplace=True)
name = transform.pop("transform")
transform = transform.add_prefix(" - ")
transform = transform.add_suffix(":")
transform = transform.to_string()
header = "{}. {}\n".format(step + 1, name)
print(header + transform, end="\n\n")
if len(transforms) == 0:
print("No transforms applied", end="\n\n")
================================================
FILE: composeml/label_times/deserialize.py
================================================
import json
import os
import pandas as pd
from composeml.label_times.object import LabelTimes
def read_config(path):
"""Reads config file from disk."""
file = os.path.join(path, "settings.json")
assert os.path.exists(file), "settings not found: '%s'" % file
with open(file, "r") as file:
settings = json.load(file)
return settings
def read_data(path):
"""Reads data file from disk."""
file = ""
for file in os.listdir(path):
if file.startswith("data"):
break
assert file.startswith("data"), "data not found"
extension = os.path.splitext(file)[1].lstrip(".")
info = "file extension must be csv, parquet, or pickle"
assert extension in ["csv", "parquet", "pickle"], info
read = getattr(pd, "read_%s" % extension)
data = read(os.path.join(path, file))
return data
def read_label_times(path, load_settings=True):
"""Reads label times from disk.
Args:
path (str): Directory where label times is stored.
Returns:
lt (LabelTimes): Deserialized label times.
"""
kwargs = {}
data = read_data(path)
if load_settings:
config = read_config(path)
data = data.astype(config["dtypes"])
kwargs.update(config["label_times"])
lt = LabelTimes(data=data, **kwargs)
return lt
================================================
FILE: composeml/label_times/object.py
================================================
import json
import os
import pandas as pd
from composeml.label_times.description import describe_label_times
from composeml.label_times.plots import LabelPlots
from composeml.version import __version__
SCHEMA_VERSION = "0.1.0"
class LabelTimes(pd.DataFrame):
"""The data frame that contains labels and cutoff times for the target dataframe."""
def __init__(
self,
data=None,
target_dataframe_index=None,
target_types=None,
target_columns=None,
search_settings=None,
transforms=None,
*args,
**kwargs,
):
super().__init__(data=data, *args, **kwargs)
self.target_dataframe_index = target_dataframe_index
self.target_columns = target_columns or []
self.target_types = target_types or {}
self.search_settings = search_settings or {}
self.transforms = transforms or []
self.plot = LabelPlots(self)
if not self.empty:
self._check_label_times()
def _assert_single_target(self):
"""Asserts that the label times object contains a single target."""
info = "must first select an individual target"
assert self._is_single_target, info
def _check_target_columns(self):
"""Validates the target columns."""
if not self.target_columns:
self.target_columns = self._infer_target_columns()
else:
for target in self.target_columns:
info = 'target "%s" not found in data frame'
assert target in self.columns, info % target
def _check_target_types(self):
"""Validates the target types."""
if isinstance(self.target_types, dict):
self.target_types = pd.Series(self.target_types, dtype="object")
if self.target_types.empty:
self.target_types = self._infer_target_types()
else:
target_names = self.target_types.index.tolist()
match = target_names == self.target_columns
assert match, "target names in types must match target columns"
def _check_label_times(self):
"""Validates the lables times object."""
self._check_target_columns()
self._check_target_types()
def _infer_target_columns(self):
"""Infers the names of the targets in the data frame.
Returns:
value (list): A list of the target names.
"""
not_targets = [self.target_dataframe_index, "time"]
target_columns = self.columns.difference(not_targets)
assert not target_columns.empty, "target columns not found"
value = target_columns.tolist()
return value
@property
def _is_single_target(self):
return len(self.target_columns) == 1
def _get_target_type(self, dtype):
is_discrete = pd.api.types.is_bool_dtype(dtype)
is_discrete |= pd.api.types.is_categorical_dtype(dtype)
is_discrete |= pd.api.types.is_object_dtype(dtype)
value = "discrete" if is_discrete else "continuous"
return value
def _infer_target_types(self):
"""Infers the target type from the data type.
Returns:
types (Series): Inferred label type. Either "continuous" or "discrete".
"""
dtypes = self.dtypes[self.target_columns]
types = dtypes.apply(self._get_target_type)
return types
def select(self, target):
"""Selects one of the target variables.
Args:
target (str): The name of the target column.
Returns:
lt (LabelTimes): A label times object that contains a single target.
Examples:
Create a label times object that contains multiple target variables.
>>> entity = [0, 0, 1, 1]
>>> labels = [True, False, True, False]
>>> time = ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04']
>>> data = {'entity': entity, 'time': time, 'A': labels, 'B': labels}
>>> lt = LabelTimes(data=data, target_dataframe_index='entity', target_columns=['A', 'B'])
>>> lt
entity time A B
0 0 2020-01-01 True True
1 0 2020-01-02 False False
2 1 2020-01-03 True True
3 1 2020-01-04 False False
Select a single target from the label times.
>>> lt.select('B')
entity time B
0 0 2020-01-01 True
1 0 2020-01-02 False
2 1 2020-01-03 True
3 1 2020-01-04 False
"""
assert not self._is_single_target, "only one target exists"
if not isinstance(target, str):
raise TypeError("target name must be string")
assert target in self.target_columns, 'target "%s" not found' % target
lt = self.copy()
lt.target_columns = [target]
lt.target_types = lt.target_types[[target]]
lt = lt[[self.target_dataframe_index, "time", target]]
return lt
@property
def settings(self):
"""Returns metadata about the label times."""
return {
"compose_version": __version__,
"schema_version": SCHEMA_VERSION,
"label_times": {
"target_dataframe_index": self.target_dataframe_index,
"target_columns": self.target_columns,
"target_types": self.target_types.to_dict(),
"search_settings": self.search_settings,
"transforms": self.transforms,
},
}
@property
def is_discrete(self):
"""Whether labels are discrete."""
return self.target_types.eq("discrete")
@property
def distribution(self):
"""Returns label distribution if labels are discrete."""
self._assert_single_target()
target_column = self.target_columns[0]
if self.is_discrete[target_column]:
labels = self.assign(count=1)
labels = labels.groupby(target_column)
distribution = labels["count"].count()
return distribution
else:
return self[target_column].describe()
@property
def count(self):
"""Returns label count per instance."""
self._assert_single_target()
count = self.groupby(self.target_dataframe_index)
count = count[self.target_columns[0]].count()
count = count.to_frame("count")
return count
@property
def count_by_time(self):
"""Returns label count across cutoff times."""
self._assert_single_target()
target_column = self.target_columns[0]
if self.is_discrete[target_column]:
keys = ["time", target_column]
value = self.groupby(keys).time.count()
value = value.unstack(target_column).fillna(0)
else:
value = self.groupby("time")
value = value[target_column].count()
value = (
value.cumsum()
) # In Python 3.5, these values automatically convert to float.
value = value.astype("int")
return value
def describe(self):
"""Prints out the settings used to make the label times."""
if not self.empty:
self._assert_single_target()
describe_label_times(self)
def copy(self, deep=True):
"""Make a copy of this object's indices and data.
Args:
deep (bool): Make a deep copy, including a copy of the data and the indices.
With ``deep=False`` neither the indices nor the data are copied. Default is True.
Returns:
lt (LabelTimes): A copy of the label times object.
"""
lt = super().copy(deep=deep)
lt.target_dataframe_index = self.target_dataframe_index
lt.target_columns = self.target_columns
lt.target_types = self.target_types.copy()
lt.search_settings = self.search_settings.copy()
lt.transforms = self.transforms.copy()
return lt
def threshold(self, value, inplace=False):
"""Creates binary labels by testing if labels are above threshold.
Args:
value (float) : Value of threshold.
inplace (bool) : Modify labels in place.
Returns:
labels (LabelTimes) : Instance of labels.
"""
self._assert_single_target()
target_column = self.target_columns[0]
labels = self if inplace else self.copy()
labels[target_column] = labels[target_column].gt(value)
labels.target_types[target_column] = "discrete"
transform = {"transform": "threshold", "value": value}
labels.transforms.append(transform)
if not inplace:
return labels
def apply_lead(self, value, inplace=False):
"""Shifts the label times earlier for predicting in advance.
Args:
value (str) : Time to shift earlier.
inplace (bool) : Modify labels in place.
Returns:
labels (LabelTimes) : Instance of labels.
"""
labels = self if inplace else self.copy()
labels["time"] = labels["time"].sub(pd.Timedelta(value))
transform = {"transform": "apply_lead", "value": value}
labels.transforms.append(transform)
if not inplace:
return labels
def bin(self, bins, quantiles=False, labels=None, right=True, precision=3):
"""Bin labels into discrete intervals.
Args:
bins (int or array): The criteria to bin by.
As an integer, the value can be the number of equal-width or quantile-based bins.
If :code:`quantiles` is False, the value is defined as the number of equal-width bins.
The range is extended by .1% on each side to include the minimum and maximum values.
If :code:`quantiles` is True, the value is defined as the number of quantiles (e.g. 10 for deciles, 4 for quartiles, etc.)
As an array, the value can be custom or quantile-based edges.
If :code:`quantiles` is False, the value is defined as bin edges allowing for non-uniform width. No extension is done.
If :code:`quantiles` is True, the value is defined as bin edges usings an array of quantiles (e.g. [0, .25, .5, .75, 1.] for quartiles)
quantiles (bool): Determines whether to use a quantile-based discretization function.
labels (array): Specifies the labels for the returned bins. Must be the same length as the resulting bins.
right (bool) : Indicates whether bins includes the rightmost edge or not. Does not apply to quantile-based bins.
precision (int): The precision at which to store and display the bins labels. Default value is 3.
Returns:
LabelTimes : Instance of labels.
Examples:
These are the target values for the examples.
>>> data = [226.93, 47.95, 283.46, 31.54]
>>> lt = LabelTimes({'target': data})
>>> lt
target
0 226.93
1 47.95
2 283.46
3 31.54
Bin values using equal-widths.
>>> lt.bin(2)
target
0 (157.5, 283.46]
1 (31.288, 157.5]
2 (157.5, 283.46]
3 (31.288, 157.5]
Bin values using custom-widths.
>>> lt.bin([0, 200, 400])
target
0 (200, 400]
1 (0, 200]
2 (200, 400]
3 (0, 200]
Bin values using infinite edges.
>>> lt.bin(['-inf', 100, 'inf'])
target
0 (100.0, inf]
1 (-inf, 100.0]
2 (100.0, inf]
3 (-inf, 100.0]
Bin values using quartiles.
>>> lt.bin(4, quantiles=True)
target
0 (137.44, 241.062]
1 (43.848, 137.44]
2 (241.062, 283.46]
3 (31.538999999999998, 43.848]
Bin values using custom quantiles with precision.
>>> lt.bin([0, .5, 1], quantiles=True, precision=1)
target
0 (137.4, 283.5]
1 (31.4, 137.4]
2 (137.4, 283.5]
3 (31.4, 137.4]
Assign labels to bins.
>>> lt.bin(2, labels=['low', 'high'])
target
0 high
1 low
2 high
3 low
""" # noqa
self._assert_single_target()
target_column = self.target_columns[0]
values = self[target_column].values
if quantiles:
values = pd.qcut(values, q=bins, labels=labels, precision=precision)
else:
if isinstance(bins, list):
for i, edge in enumerate(bins):
if edge in ["-inf", "inf"]:
bins[i] = float(edge)
values = pd.cut(
values,
bins=bins,
labels=labels,
right=right,
precision=precision,
)
transform = {
"transform": "bin",
"bins": bins,
"quantiles": quantiles,
"labels": labels,
"right": right,
"precision": precision,
}
lt = self.copy()
lt[target_column] = values
lt.transforms.append(transform)
lt.target_types[target_column] = "discrete"
return lt
def _sample(self, key, value, settings, random_state=None, replace=False):
"""Returns a random sample of labels.
Args:
key (str) : Determines the sampling method. Can either be 'n' or 'frac'.
value (int or float) : Quantity to sample.
settings (dict) : Transform settings used for sampling.
random_state (int) : Seed for the random number generator.
replace (bool) : Sample with or without replacement. Default value is False.
Returns:
LabelTimes : Random sample of labels.
"""
sample = super().sample(
random_state=random_state, replace=replace, **{key: value}
)
return sample
def _sample_per_label(self, key, value, settings, random_state=None, replace=False):
"""Returns a random sample per label.
Args:
key (str) : Determines the sampling method. Can either be 'n' or 'frac'.
value (dict) : Quantity to sample per label.
settings (dict) : Transform settings used for sampling.
random_state (int) : Seed for the random number generator.
replace (bool) : Sample with or without replacement. Default value is False.
Returns:
LabelTimes : Random sample per label.
"""
sample_per_label = []
target_column = self.target_columns[0]
for (
label,
value,
) in value.items():
label = self[self[target_column] == label]
sample = label._sample(
key,
value,
settings,
random_state=random_state,
replace=replace,
)
sample_per_label.append(sample)
sample = pd.concat(sample_per_label, axis=0, sort=False)
return sample
def sample(
self,
n=None,
frac=None,
random_state=None,
replace=False,
per_instance=False,
):
"""Return a random sample of labels.
Args:
n (int or dict) : Sample number of labels. A dictionary returns
the number of samples to each label. Cannot be used with frac.
frac (float or dict) : Sample fraction of labels. A dictionary returns
the sample fraction to each label. Cannot be used with n.
random_state (int) : Seed for the random number generator.
replace (bool) : Sample with or without replacement. Default value is False.
per_instance (bool): Whether to apply sampling to each group. Default is False.
Returns:
LabelTimes : Random sample of labels.
Examples:
Create a label times object.
>>> entity = [0, 0, 1, 1]
>>> labels = [True, False, True, False]
>>> data = {'entity': entity, 'labels': labels}
>>> lt = LabelTimes(data=data, target_dataframe_index='entity', target_columns=['labels'])
>>> lt
entity labels
0 0 True
1 0 False
2 1 True
3 1 False
Sample a number of the examples.
>>> lt.sample(n=3, random_state=0)
entity labels
1 0 False
2 1 True
3 1 False
Sample a fraction of the examples.
>>> lt.sample(frac=.25, random_state=0)
entity labels
2 1 True
Sample a number of the examples for specific labels.
>>> n = {True: 1, False: 1}
>>> lt.sample(n=n, random_state=0)
entity labels
2 1 True
3 1 False
Sample a fraction of the examples for specific labels.
>>> frac = {True: .5, False: .5}
>>> lt.sample(frac=frac, random_state=0)
entity labels
2 1 True
3 1 False
Sample a number of the examples from each entity group.
>>> lt.sample(n={True: 1}, per_instance=True, random_state=0)
entity labels
0 0 True
2 1 True
Sample a fraction of the examples from each entity group.
>>> lt.sample(frac=.5, per_instance=True, random_state=0)
entity labels
1 0 False
3 1 False
""" # noqa
self._assert_single_target()
settings = {
"transform": "sample",
"n": n,
"frac": frac,
"random_state": random_state,
"replace": replace,
"per_instance": per_instance,
}
key, value = ("n", n) if n else ("frac", frac)
assert value, "must set value for 'n' or 'frac'"
per_label = isinstance(value, dict)
method = "_sample_per_label" if per_label else "_sample"
def transform(lt):
sample = getattr(lt, method)(
key=key,
value=value,
settings=settings,
random_state=random_state,
replace=replace,
)
return sample
if per_instance:
groupby = self.groupby(self.target_dataframe_index, group_keys=False)
sample = groupby.apply(transform)
else:
sample = transform(self)
sample = sample.copy()
sample.sort_index(inplace=True)
sample.transforms.append(settings)
return sample
def equals(self, other, **kwargs):
"""Determines if two label time objects are the same.
Args:
other (LabelTimes) : Other label time object for comparison.
**kwargs: Keyword arguments to pass to underlying pandas.DataFrame.equals method
Returns:
bool : Whether label time objects are the same.
"""
is_equal = super().equals(other, **kwargs)
is_equal &= self.settings == other.settings
return is_equal
def _save_settings(self, path):
"""Write the settings in json format to disk.
Args:
path (str) : Directory on disk to write to.
"""
settings = self.settings
dtypes = self.dtypes.astype("str")
settings["dtypes"] = dtypes.to_dict()
file = os.path.join(path, "settings.json")
with open(file, "w") as file:
json.dump(settings, file)
def to_csv(self, path, save_settings=True, **kwargs):
"""Write label times in csv format to disk.
Args:
path (str) : Location on disk to write to (will be created as a directory).
save_settings (bool) : Whether to save the settings used to make the label times.
**kwargs: Keyword arguments to pass to underlying pandas.DataFrame.to_csv method
"""
os.makedirs(path, exist_ok=True)
file = os.path.join(path, "data.csv")
super().to_csv(file, index=False, **kwargs)
if save_settings:
self._save_settings(path)
def to_parquet(self, path, save_settings=True, **kwargs):
"""Write label times in parquet format to disk.
Args:
path (str) : Location on disk to write to (will be created as a directory).
save_settings (bool) : Whether to save the settings used to make the label times.
**kwargs: Keyword arguments to pass to underlying pandas.DataFrame.to_parquet method
"""
os.makedirs(path, exist_ok=True)
file = os.path.join(path, "data.parquet")
super().to_parquet(file, compression=None, engine="auto", **kwargs)
if save_settings:
self._save_settings(path)
def to_pickle(self, path, save_settings=True, **kwargs):
"""Write label times in pickle format to disk.
Args:
path (str) : Location on disk to write to (will be created as a directory).
save_settings (bool) : Whether to save the settings used to make the label times.
**kwargs: Keyword arguments to pass to underlying pandas.DataFrame.to_pickle method
"""
os.makedirs(path, exist_ok=True)
file = os.path.join(path, "data.pickle")
super().to_pickle(file, **kwargs)
if save_settings:
self._save_settings(path)
# ----------------------------------------
# Subclassing Pandas Data Frame
# ----------------------------------------
_metadata = [
"search_settings",
"target_columns",
"target_dataframe_index",
"target_types",
"transforms",
]
def __finalize__(self, other, method=None, **kwargs):
"""Propagate metadata from other label times data frames.
Args:
other (LabelTimes) : The label times from which to get the attributes from.
method (str) : A passed method name for optionally taking different types of propagation actions based on this value.
"""
if method == "concat":
other = other.objs[0]
for key in self._metadata:
value = getattr(other, key, None)
setattr(self, key, value)
return self
return super().__finalize__(other=other, method=method, **kwargs)
@property
def _constructor(self):
return LabelTimes
================================================
FILE: composeml/label_times/plots.py
================================================
import matplotlib as mpl # isort:skip
import pandas as pd
import seaborn as sns
# Raises an import error on OSX if not included.
# https://matplotlib.org/3.1.0/faq/osx_framework.html#working-with-matplotlib-on-osx
mpl.use("agg") # noqa
pd.plotting.register_matplotlib_converters()
sns.set_context("notebook")
sns.set_style("darkgrid")
COLOR = sns.color_palette("Set1", n_colors=100, desat=0.75)
class LabelPlots:
"""Creates plots for Label Times."""
def __init__(self, label_times):
"""Initializes Label Plots.
Args:
label_times (LabelTimes) : instance of Label Times
"""
self._label_times = label_times
def count_by_time(self, ax=None, **kwargs):
"""Plots the label distribution across cutoff times."""
count_by_time = self._label_times.count_by_time
count_by_time.sort_index(inplace=True)
target_column = self._label_times.target_columns[0]
ax = ax or mpl.pyplot.axes(label=id(self))
vmin = count_by_time.index.min()
vmax = count_by_time.index.max()
ax.set_xlim(vmin, vmax)
locator = mpl.dates.AutoDateLocator()
formatter = mpl.dates.AutoDateFormatter(locator)
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)
for label in ax.get_xticklabels():
label.set_rotation(30)
if len(count_by_time.shape) > 1:
ax.stackplot(
count_by_time.index,
count_by_time.values.T,
labels=count_by_time.columns,
colors=COLOR,
alpha=0.9,
**kwargs,
)
ax.legend(
loc="upper left",
title=target_column,
facecolor="w",
framealpha=0.9,
)
ax.set_title("Label Count vs. Cutoff Times")
ax.set_ylabel("Count")
ax.set_xlabel("Time")
else:
ax.fill_between(
count_by_time.index,
count_by_time.values.T,
color=COLOR[1],
)
ax.set_title("Label vs. Cutoff Times")
ax.set_ylabel(target_column)
ax.set_xlabel("Time")
return ax
@property
def dist(self):
"""Alias for distribution."""
return self.distribution
def distribution(self, **kwargs):
"""Plots the label distribution."""
self._label_times._assert_single_target()
target_column = self._label_times.target_columns[0]
dist = self._label_times[target_column]
is_discrete = self._label_times.is_discrete[target_column]
if is_discrete:
ax = sns.countplot(x=dist, palette=COLOR, **kwargs)
else:
ax = sns.histplot(x=dist, kde=True, color=COLOR[1], **kwargs)
ax.set_title("Label Distribution")
ax.set_ylabel("Count")
return ax
================================================
FILE: composeml/tests/__init__.py
================================================
================================================
FILE: composeml/tests/requirement_files/latest_core_dependencies.txt
================================================
featuretools==1.27.0
matplotlib==3.7.2
pandas==2.0.3
seaborn==0.12.2
tqdm==4.66.1
woodwork==0.25.1
================================================
FILE: composeml/tests/requirement_files/minimum_core_requirements.txt
================================================
matplotlib==3.3.3
pandas==2.0.0
seaborn==0.12.2
tqdm==4.32.0
================================================
FILE: composeml/tests/requirement_files/minimum_test_requirements.txt
================================================
featuretools==1.27.0
matplotlib==3.3.3
pandas==2.0.0
pip==21.3.1
pyarrow==7.0.0
pytest-cov==3.0.0
pytest-xdist==2.5.0
pytest==7.1.2
seaborn==0.12.2
tqdm==4.32.0
wheel==0.33.1
woodwork==0.25.1
================================================
FILE: composeml/tests/test_data_slice/__init__.py
================================================
================================================
FILE: composeml/tests/test_data_slice/test_extension.py
================================================
import pandas as pd
from pytest import fixture, mark, raises
from composeml import LabelMaker
@fixture
def data_slice(transactions):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
window_size="1h",
)
ds = next(lm.slice(transactions, num_examples_per_instance=1))
return ds
def test_context(data_slice):
print(data_slice.context)
context = str(data_slice.context)
actual = context.splitlines()
expected = [
"customer_id 0",
"slice_number 1",
"slice_start 2019-01-01 08:00:00",
"slice_stop 2019-01-01 09:00:00",
"next_start 2019-01-01 09:00:00",
]
assert actual == expected
def test_context_aliases(data_slice):
assert data_slice.context == data_slice.ctx
assert data_slice.context.slice_number == data_slice.ctx.count
assert data_slice.context.slice_start == data_slice.ctx.start
assert data_slice.context.slice_stop == data_slice.ctx.stop
@mark.parametrize(
"time_based,offsets",
argvalues=[
[False, (2, 4, 2)],
[False, (2, -6, 2)],
[True, (pd.Timedelta("1h"), pd.Timedelta("2h"), pd.Timedelta("1h"))],
[True, (pd.Timedelta("1h"), pd.Timedelta("-2h30min"), pd.Timedelta("1h"))],
[True, ("2019-01-01 09:00:00", "2019-01-01 10:00:00", pd.Timedelta("1h"))],
],
)
def test_subscriptable_slices(transactions, time_based, offsets):
if time_based:
dtypes = {"time": "datetime64[ns]"}
transactions = transactions.astype(dtypes)
transactions.set_index("time", inplace=True)
start, stop, size = offsets
slices = transactions.slice[start:stop:size]
actual = tuple(map(len, slices))
assert actual == (2, 2)
def test_subscriptable_error(transactions):
with raises(TypeError, match="must be a slice object"):
transactions.slice[0]
def test_time_index_error(transactions):
match = "offset by frequency requires a time index"
with raises(AssertionError, match=match):
transactions.slice[::"1h"]
def test_minimum_data_per_group(transactions):
lm = LabelMaker(
"customer_id",
labeling_function=len,
time_index="time",
window_size="1h",
)
minimum_data = {1: "2019-01-01 09:00:00", 3: "2019-01-01 12:00:00"}
lengths = [len(ds) for ds in lm.slice(transactions, 1, minimum_data=minimum_data)]
assert lengths == [2, 1]
def test_drop_empty(transactions):
df = transactions.astype({"time": "datetime64[ns]"})
df.set_index("time", inplace=True)
df.sort_index(inplace=True)
ds = df.slice(
size="1h",
drop_empty=True,
stop="2019-01-01 15:00:00",
start="2019-01-01 08:00:00",
)
assert len(list(ds)) == 5
================================================
FILE: composeml/tests/test_data_slice/test_offset.py
================================================
from pytest import raises
from composeml.data_slice.offset import DataSliceOffset
def test_numeric_typecast():
assert int(DataSliceOffset("1 nanosecond")) == 1
assert float(DataSliceOffset("1970-01-01")) == 0.0
def test_numeric_typecast_errors():
match = "offset must be position or frequency based"
with raises(TypeError, match=match):
int(DataSliceOffset("1970-01-01"))
match = "offset must be a timestamp"
with raises(TypeError, match=match):
float(DataSliceOffset("1 nanosecond"))
def test_invalid_value():
match = "offset must be position or time based"
with raises(AssertionError, match=match):
DataSliceOffset(None)
def test_alias_phrase():
phrase = "until start of next month"
actual = DataSliceOffset(phrase).value
expected = DataSliceOffset("MS").value
assert actual == expected
phrase = "until start of next year"
actual = DataSliceOffset(phrase).value
expected = DataSliceOffset("YS").value
assert actual == expected
================================================
FILE: composeml/tests/test_datasets.py
================================================
import pytest
from composeml import demos
@pytest.fixture
def transactions():
return demos.load_transactions()
def test_transactions(transactions):
assert len(transactions) == 100
================================================
FILE: composeml/tests/test_featuretools.py
================================================
import featuretools as ft
import pytest
from composeml import LabelMaker
def total_spent(df):
total = df.amount.sum()
return total
@pytest.fixture
def labels():
df = ft.demo.load_mock_customer(return_single_table=True, random_seed=0)
df = df[["transaction_time", "customer_id", "amount"]]
df.sort_values("transaction_time", inplace=True)
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="transaction_time",
labeling_function=total_spent,
window_size="1h",
)
lt = lm.search(
df,
minimum_data="10min",
num_examples_per_instance=2,
gap="30min",
drop_empty=True,
verbose=False,
)
lt = lt.threshold(1250)
return lt
def test_dfs(labels):
target_column = labels.target_columns[0]
es = ft.demo.load_mock_customer(return_entityset=True, random_seed=0)
feature_matrix, _ = ft.dfs(
entityset=es,
target_dataframe_name="customers",
cutoff_time=labels,
cutoff_time_in_index=True,
)
assert target_column in feature_matrix
columns = ["customer_id", "time", target_column]
given_labels = feature_matrix.reset_index()[columns]
given_labels = given_labels.sort_values(["customer_id", "time"])
given_labels = given_labels.reset_index(drop=True)
given_labels = given_labels.rename_axis("label_id")
assert given_labels.equals(labels)
================================================
FILE: composeml/tests/test_label_maker.py
================================================
import pandas as pd
import pytest
from composeml import LabelMaker
from composeml.tests.utils import to_csv
def test_search_default(transactions, total_spent_fn):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
)
given_labels = lm.search(transactions, num_examples_per_instance=1)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"0,2019-01-01 08:00:00,2",
"1,2019-01-01 09:00:00,3",
"2,2019-01-01 10:30:00,4",
"3,2019-01-01 12:30:00,1",
]
assert given_labels == labels
def test_search_examples_per_label(transactions, total_spent_fn):
def total_spent(ds):
return total_spent_fn(ds) > 2
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent,
)
n_examples = {True: -1, False: 1}
given_labels = lm.search(transactions, num_examples_per_instance=n_examples, gap=1)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"0,2019-01-01 08:00:00,False",
"1,2019-01-01 09:00:00,True",
"1,2019-01-01 09:30:00,False",
"2,2019-01-01 10:30:00,True",
"2,2019-01-01 11:00:00,True",
"2,2019-01-01 11:30:00,False",
"3,2019-01-01 12:30:00,False",
]
assert given_labels == labels
def test_search_with_undefined_labels(transactions, total_spent_fn):
def total_spent(ds):
return total_spent_fn(ds) % 3
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent,
)
n_examples = {1: 1, 2: 1}
given_labels = lm.search(transactions, num_examples_per_instance=n_examples, gap=1)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"0,2019-01-01 08:00:00,2",
"0,2019-01-01 08:30:00,1",
"1,2019-01-01 09:30:00,2",
"1,2019-01-01 10:00:00,1",
"2,2019-01-01 10:30:00,1",
"2,2019-01-01 11:30:00,2",
"3,2019-01-01 12:30:00,1",
]
assert given_labels == labels
def test_search_with_multiple_targets(transactions, total_spent_fn, unique_amounts_fn):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
window_size=2,
labeling_function={
"total_spent": total_spent_fn,
"unique_amounts": unique_amounts_fn,
},
)
expected = [
"customer_id,time,total_spent,unique_amounts",
"0,2019-01-01 08:00:00,2,1",
"1,2019-01-01 09:00:00,2,1",
"1,2019-01-01 10:00:00,1,1",
"2,2019-01-01 10:30:00,2,1",
"2,2019-01-01 11:30:00,2,1",
"3,2019-01-01 12:30:00,1,1",
]
lt = lm.search(transactions, num_examples_per_instance=-1)
actual = lt.pipe(to_csv, index=False)
info = "unexpected calculated values"
assert actual == expected, info
expected = [
"customer_id,time,unique_amounts",
"0,2019-01-01 08:00:00,1",
"1,2019-01-01 09:00:00,1",
"1,2019-01-01 10:00:00,1",
"2,2019-01-01 10:30:00,1",
"2,2019-01-01 11:30:00,1",
"3,2019-01-01 12:30:00,1",
]
actual = lt.select("unique_amounts")
actual = actual.pipe(to_csv, index=False)
info = "selected values differ from calculated values"
assert actual == expected, info
def test_search_offset_mix_0(transactions, total_spent_fn):
"""
Test offset mix with window_size (absolute), minimum_data (absolute), and gap (absolute).
"""
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
window_size="2h",
)
given_labels = lm.search(
transactions,
num_examples_per_instance=2,
minimum_data="30min",
gap="2h",
drop_empty=True,
)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"0,2019-01-01 08:30:00,1",
"1,2019-01-01 09:30:00,2",
"2,2019-01-01 11:00:00,3",
]
assert given_labels == labels
def test_search_offset_mix_1(transactions, total_spent_fn):
"""
Test offset mix with window_size (relative), minimum_data (absolute), and gap (absolute).
"""
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
window_size=4,
)
given_labels = lm.search(
transactions,
num_examples_per_instance=2,
minimum_data="2019-01-01 10:00:00",
gap="4h",
)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"1,2019-01-01 10:00:00,1",
"2,2019-01-01 10:00:00,4",
"3,2019-01-01 10:00:00,1",
]
assert given_labels == labels
def test_search_offset_mix_2(transactions, total_spent_fn):
"""
Test offset mix with window_size (absolute), minimum_data (relative), and gap (absolute).
"""
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
window_size="30min",
)
given_labels = lm.search(
transactions,
num_examples_per_instance=2,
minimum_data=2,
)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"1,2019-01-01 10:00:00,1",
"2,2019-01-01 11:30:00,1",
"2,2019-01-01 12:00:00,1",
]
assert given_labels == labels
def test_search_offset_mix_3(transactions, total_spent_fn):
"""
Test offset mix with window_size (absolute), minimum_data (absolute), and gap (relative).
"""
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
window_size="8h",
)
given_labels = lm.search(
transactions,
num_examples_per_instance=-1,
minimum_data="2019-01-01 08:00:00",
gap=1,
)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"0,2019-01-01 08:00:00,2",
"0,2019-01-01 08:30:00,1",
"1,2019-01-01 09:00:00,3",
"1,2019-01-01 09:30:00,2",
"1,2019-01-01 10:00:00,1",
"2,2019-01-01 10:30:00,4",
"2,2019-01-01 11:00:00,3",
"2,2019-01-01 11:30:00,2",
"2,2019-01-01 12:00:00,1",
"3,2019-01-01 12:30:00,1",
]
assert given_labels == labels
def test_search_offset_mix_4(transactions, total_spent_fn):
"""
Test offset mix with window_size (relative), minimum_data (relative), and gap (absolute).
"""
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
window_size=1,
)
given_labels = lm.search(
transactions,
num_examples_per_instance=2,
gap="30min",
)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"0,2019-01-01 08:00:00,1",
"0,2019-01-01 08:30:00,1",
"1,2019-01-01 09:00:00,1",
"1,2019-01-01 09:30:00,1",
"2,2019-01-01 10:30:00,1",
"2,2019-01-01 11:00:00,1",
"3,2019-01-01 12:30:00,1",
]
assert given_labels == labels
def test_search_offset_mix_5(transactions, total_spent_fn):
"""
Test offset mix with window_size (relative), minimum_data (absolute), and gap (relative).
"""
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
window_size=2,
)
labels = lm.search(
transactions,
num_examples_per_instance=2,
minimum_data="1h",
gap=2,
)
labels = to_csv(labels, index=False)
expected_labels = [
"customer_id,time,total_spent",
"0,2019-01-01 08:00:00,2",
"1,2019-01-01 09:00:00,2",
"1,2019-01-01 10:00:00,1",
"2,2019-01-01 10:30:00,2",
"2,2019-01-01 11:30:00,2",
"3,2019-01-01 12:30:00,1",
]
assert labels == expected_labels
def test_search_offset_mix_6(transactions, total_spent_fn):
"""
Test offset mix with window_size (absolute), minimum_data (relative), and gap (relative).
"""
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
window_size="1h",
)
given_labels = lm.search(
transactions,
num_examples_per_instance=1,
minimum_data=3,
gap=1,
)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"2,2019-01-01 12:00:00,1",
]
assert given_labels == labels
def test_search_offset_mix_7(transactions, total_spent_fn):
"""
Test offset mix with window_size (relative), minimum_data (relative), and gap (relative).
"""
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
window_size=10,
)
given_labels = lm.search(
transactions,
num_examples_per_instance=float("inf"),
)
given_labels = to_csv(given_labels, index=False)
labels = [
"customer_id,time,total_spent",
"0,2019-01-01 08:00:00,2",
"1,2019-01-01 09:00:00,3",
"2,2019-01-01 10:30:00,4",
"3,2019-01-01 12:30:00,1",
]
assert given_labels == labels
def test_search_offset_negative_0(transactions, total_spent_fn):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=lambda: None,
window_size=2,
)
match = "offset must be positive"
with pytest.raises(AssertionError, match=match):
lm.search(
transactions,
num_examples_per_instance=2,
minimum_data=-1,
gap=-1,
)
def test_search_offset_negative_1(transactions, total_spent_fn):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=lambda: None,
window_size=2,
)
match = "offset must be positive"
with pytest.raises(AssertionError, match=match):
lm.search(
transactions,
num_examples_per_instance=2,
minimum_data="-1h",
gap="-1h",
)
def test_search_invalid_n_examples(transactions, total_spent_fn):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
)
with pytest.raises(AssertionError, match="must specify gap"):
next(lm.slice(transactions, num_examples_per_instance=2))
with pytest.raises(AssertionError, match="must specify gap"):
lm.search(transactions, num_examples_per_instance=2)
def test_column_based_windows(transactions, total_spent_fn):
session_id = [1, 2, 3, 3, 4, 5, 5, 5, 6, 7]
df = transactions.assign(session_id=session_id)
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
window_size="session_id",
labeling_function=total_spent_fn,
)
actual = lm.search(df, -1).pipe(to_csv, index=False)
expected = [
"customer_id,time,total_spent",
"0,2019-01-01 08:00:00,1",
"0,2019-01-01 08:30:00,1",
"1,2019-01-01 09:00:00,2",
"1,2019-01-01 10:00:00,1",
"2,2019-01-01 10:30:00,3",
"2,2019-01-01 12:00:00,1",
"3,2019-01-01 12:30:00,1",
]
assert actual == expected
def test_search_with_invalid_index(transactions, total_spent_fn):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=lambda df: None,
window_size=2,
)
df = transactions.sample(n=10, random_state=0)
match = "data frame must be sorted chronologically"
with pytest.raises(AssertionError, match=match):
lm.search(df, num_examples_per_instance=2)
df = transactions.assign(time=pd.NaT)
match = "index contains null values"
with pytest.raises(AssertionError, match=match):
lm.search(df, num_examples_per_instance=2)
def test_search_on_empty_labels(transactions):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=lambda ds: None,
window_size=2,
)
given_labels = lm.search(
transactions,
minimum_data=1,
num_examples_per_instance=2,
gap=1,
)
assert given_labels.empty
def test_data_slice_overlap(transactions, total_spent_fn):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
window_size="1h",
)
for ds in lm.slice(transactions, num_examples_per_instance=2):
overlap = ds.index == ds.context.slice_stop
assert not overlap.any()
def test_label_type(transactions, total_spent_fn):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
)
lt = lm.search(transactions, num_examples_per_instance=1)
assert lt.target_types["total_spent"] == "continuous"
assert lt.bin(2).target_types["total_spent"] == "discrete"
def test_search_with_maximum_data(transactions):
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=len,
window_size="1h",
)
lt = lm.search(
df=transactions.sort_values("time"),
num_examples_per_instance=-1,
minimum_data="2019-01-01 08:00:00",
maximum_data="2019-01-01 09:00:00",
drop_empty=False,
)
expected = [
"customer_id,time,len",
"0,2019-01-01 08:00:00,2",
"0,2019-01-01 09:00:00,0",
"1,2019-01-01 08:00:00,0",
"1,2019-01-01 09:00:00,2",
"2,2019-01-01 08:00:00,0",
"2,2019-01-01 09:00:00,0",
"3,2019-01-01 08:00:00,0",
"3,2019-01-01 09:00:00,0",
]
actual = lt.pipe(to_csv, index=False)
assert actual == expected
lt = lm.search(
df=transactions.sort_values("time"),
num_examples_per_instance=-1,
maximum_data="30min",
drop_empty=False,
gap="30min",
)
expected = [
"customer_id,time,len",
"0,2019-01-01 08:00:00,2",
"0,2019-01-01 08:30:00,1",
"1,2019-01-01 09:00:00,2",
"1,2019-01-01 09:30:00,2",
"2,2019-01-01 10:30:00,2",
"2,2019-01-01 11:00:00,2",
"3,2019-01-01 12:30:00,1",
"3,2019-01-01 13:00:00,0",
]
actual = lt.pipe(to_csv, index=False)
assert actual == expected
@pytest.mark.parametrize(
"minimum_data",
[
{1: "2019-01-01 09:30:00", 2: "2019-01-01 11:30:00"},
{1: pd.Timedelta("30min"), 2: pd.Timedelta("1h")},
{1: 1, 2: 2},
],
)
def test_minimum_data_per_group(transactions, minimum_data):
lm = LabelMaker(
"customer_id",
labeling_function=len,
time_index="time",
window_size="1h",
)
for supported_type in [minimum_data, pd.Series(minimum_data)]:
lt = lm.search(transactions, 1, minimum_data=supported_type)
actual = to_csv(lt, index=False)
expected = [
"customer_id,time,len",
"1,2019-01-01 09:30:00,2",
"2,2019-01-01 11:30:00,2",
]
assert actual == expected
def test_minimum_data_per_group_error(transactions):
lm = LabelMaker(
"customer_id",
labeling_function=len,
time_index="time",
window_size="1h",
)
data = ["2019-01-01 09:00:00", "2019-01-01 12:00:00"]
minimum_data = pd.Series(data=data, index=[1, 1])
match = "more than one cutoff time exists for a target group"
with pytest.raises(ValueError, match=match):
lm.search(transactions, 1, minimum_data=minimum_data)
def test_label_maker_categorical_target_with_missing_data(transactions, total_spent_fn):
transactions = transactions.copy()
transactions["customer_id"] = transactions["customer_id"].astype("category")
lm = LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
window_size=3,
labeling_function=total_spent_fn,
)
# use on only the first 8 rows so the df will not contain data for customer 3
lm.search(transactions.head(8), -1)
================================================
FILE: composeml/tests/test_label_plots.py
================================================
from pytest import raises
def test_count_by_time_categorical(total_spent):
total_spent = total_spent.bin(2, labels=range(2))
title = total_spent.plot.count_by_time().get_title()
assert title == "Label Count vs. Cutoff Times"
def test_count_by_time_continuous(total_spent):
title = total_spent.plot.count_by_time().get_title()
assert title == "Label vs. Cutoff Times"
def test_distribution_categorical(total_spent):
ax = total_spent.bin(2, labels=range(2))
title = ax.plot.dist().get_title()
assert title == "Label Distribution"
def test_distribution_continuous(total_spent):
title = total_spent.plot.dist().get_title()
assert title == "Label Distribution"
def test_single_target(total_spent):
lt = total_spent.copy()
lt.target_columns.append("target_2")
match = "must first select an individual target"
with raises(AssertionError, match=match):
lt.plot.dist()
with raises(AssertionError, match=match):
lt.plot.count_by_time()
================================================
FILE: composeml/tests/test_label_serialization.py
================================================
import os
import shutil
import pandas as pd
import pytest
import composeml as cp
@pytest.fixture
def path():
pwd = os.path.dirname(__file__)
path = os.path.join(pwd, ".cache")
yield path
shutil.rmtree(path)
@pytest.fixture
def total_spent(transactions, total_spent_fn):
lm = cp.LabelMaker(
target_dataframe_index="customer_id",
time_index="time",
labeling_function=total_spent_fn,
)
lt = lm.search(transactions, num_examples_per_instance=1, verbose=False)
return lt
def test_csv(path, total_spent):
total_spent.to_csv(path)
total_spent_copy = cp.read_label_times(path)
pd.testing.assert_frame_equal(total_spent, total_spent_copy)
assert total_spent.equals(total_spent_copy)
def test_parquet(path, total_spent):
total_spent.to_parquet(path)
total_spent_copy = cp.read_label_times(path)
pd.testing.assert_frame_equal(total_spent, total_spent_copy)
assert total_spent.equals(total_spent_copy)
def test_pickle(path, total_spent):
total_spent.to_pickle(path)
total_spent_copy = cp.read_label_times(path)
pd.testing.assert_frame_equal(total_spent, total_spent_copy)
assert total_spent.equals(total_spent_copy)
================================================
FILE: composeml/tests/test_label_times.py
================================================
from pytest import raises
from composeml.label_times import LabelTimes
from composeml.tests.utils import to_csv
def test_count_by_time_categorical(total_spent):
given_answer = total_spent.bin(2, labels=range(2))
given_answer = to_csv(given_answer.count_by_time)
answer = [
"time,0,1",
"2019-01-01 08:00:00,0,1",
"2019-01-01 08:30:00,0,2",
"2019-01-01 09:00:00,0,3",
"2019-01-01 09:30:00,0,4",
"2019-01-01 10:00:00,0,5",
"2019-01-01 10:30:00,1,5",
"2019-01-01 11:00:00,2,5",
"2019-01-01 11:30:00,3,5",
"2019-01-01 12:00:00,4,5",
"2019-01-01 12:30:00,5,5",
]
assert given_answer == answer
def test_count_by_time_continuous(total_spent):
given_answer = total_spent.count_by_time
given_answer = to_csv(given_answer, header=True, index=True)
answer = [
"time,total_spent",
"2019-01-01 08:00:00,1",
"2019-01-01 08:30:00,2",
"2019-01-01 09:00:00,3",
"2019-01-01 09:30:00,4",
"2019-01-01 10:00:00,5",
"2019-01-01 10:30:00,6",
"2019-01-01 11:00:00,7",
"2019-01-01 11:30:00,8",
"2019-01-01 12:00:00,9",
"2019-01-01 12:30:00,10",
]
assert given_answer == answer
def test_sorted_distribution(capsys, total_spent):
bins = [0, 5, 10, 20]
total_spent.bin(bins).describe()
captured = capsys.readouterr()
out = "\n".join(
[
"Label Distribution",
"------------------",
"total_spent",
"(0, 5] 5",
"(5, 10] 4",
"(10, 20] 0",
"Total: 9",
"",
"",
"Settings",
"--------",
"num_examples_per_instance -1",
"target_column total_spent",
"target_dataframe_index customer_id",
"target_type discrete",
"",
"",
"Transforms",
"----------",
"1. bin",
" - bins: [0, 5, 10, 20]",
" - labels: None",
" - precision: 3",
" - quantiles: False",
" - right: True",
"",
"",
],
)
assert captured.out == out
def test_describe_no_transforms(capsys):
data = {"target": range(3)}
LabelTimes(data).describe()
captured = capsys.readouterr()
out = "\n".join(
[
"Label Distribution",
"------------------",
"count 3.0",
"mean 1.0",
"std 1.0",
"min 0.0",
"25% 0.5",
"50% 1.0",
"75% 1.5",
"max 2.0",
"",
"",
"Settings",
"--------",
"target_column target",
"target_dataframe_index None",
"target_type continuous",
"",
"",
"Transforms",
"----------",
"No transforms applied",
"",
"",
],
)
assert captured.out == out
def test_distribution_categorical(total_spent):
labels = range(2)
given_answer = total_spent.bin(2, labels=labels).distribution
given_answer = to_csv(given_answer)
answer = [
"total_spent,count",
"0,5",
"1,5",
]
assert given_answer == answer
def test_distribution_continous(total_spent):
distribution = total_spent.distribution
actual = to_csv(distribution.round(4))
expected = [
",total_spent",
"count,10.0",
"mean,4.5",
"std,3.0277",
"min,0.0",
"25%,2.25",
"50%,4.5",
"75%,6.75",
"max,9.0",
]
assert actual == expected
def test_target_type(total_spent):
types = total_spent.target_types
assert types["total_spent"] == "continuous"
total_spent = total_spent.threshold(5)
types = total_spent.target_types
assert types["total_spent"] == "discrete"
def test_count(total_spent):
given_answer = total_spent.count
given_answer = to_csv(given_answer, index=True)
answer = [
"customer_id,count",
"0,2",
"1,3",
"2,4",
"3,1",
]
assert given_answer == answer
def test_label_select_errors(total_spent):
match = "only one target exists"
with raises(AssertionError, match=match):
total_spent.select("a")
lt = total_spent.copy()
lt.target_columns.append("b")
match = "target name must be string"
with raises(TypeError, match=match):
total_spent.select(123)
match = 'target "a" not found'
with raises(AssertionError, match=match):
lt.select("a")
================================================
FILE: composeml/tests/test_label_transforms/__init__.py
================================================
================================================
FILE: composeml/tests/test_label_transforms/test_bin.py
================================================
import pandas as pd
from pytest import raises
def test_bins(labels):
given_labels = labels.bin(2)
transform = given_labels.transforms[0]
assert transform["transform"] == "bin"
assert transform["bins"] == 2
assert transform["quantiles"] is False
assert transform["labels"] is None
assert transform["right"] is True
answer = [
pd.Interval(157.5, 283.46, closed="right"),
pd.Interval(31.288, 157.5, closed="right"),
pd.Interval(157.5, 283.46, closed="right"),
pd.Interval(31.288, 157.5, closed="right"),
]
answer = pd.Categorical(answer, ordered=True)
labels = labels.assign(my_labeling_function=answer)
pd.testing.assert_frame_equal(given_labels, labels)
def test_quantile_bins(labels):
given_labels = labels.bin(2, quantiles=True)
transform = given_labels.transforms[0]
assert transform["transform"] == "bin"
assert transform["bins"] == 2
assert transform["quantiles"] is True
assert transform["labels"] is None
assert transform["right"] is True
answer = [
pd.Interval(137.44, 283.46, closed="right"),
pd.Interval(31.538999999999998, 137.44, closed="right"),
pd.Interval(137.44, 283.46, closed="right"),
pd.Interval(31.538999999999998, 137.44, closed="right"),
]
answer = pd.Categorical(answer, ordered=True)
labels = labels.assign(my_labeling_function=answer)
pd.testing.assert_frame_equal(given_labels, labels)
def test_single_target(total_spent):
lt = total_spent.copy()
lt.target_columns.append("target_2")
match = "must first select an individual target"
with raises(AssertionError, match=match):
lt.bin(2)
================================================
FILE: composeml/tests/test_label_transforms/test_lead.py
================================================
import pandas as pd
def test_lead(labels):
labels = labels.apply_lead("10min")
transform = labels.transforms[0]
assert transform["transform"] == "apply_lead"
assert transform["value"] == "10min"
answer = [
"2014-01-01 00:35:00",
"2014-01-01 00:38:00",
"2013-12-31 23:51:00",
"2013-12-31 23:54:00",
]
time = pd.Series(answer, name="time", dtype="datetime64[ns]")
time = time.rename_axis("label_id")
pd.testing.assert_series_equal(labels["time"], time)
================================================
FILE: composeml/tests/test_label_transforms/test_sample.py
================================================
import pytest
from composeml import LabelTimes
from composeml.tests.utils import read_csv, to_csv
@pytest.fixture
def labels(labels):
return labels.threshold(100)
def test_sample_n_int(labels):
given_answer = labels.sample(n=2, random_state=0)
given_answer = given_answer.sort_index()
given_answer = to_csv(given_answer, index=True)
answer = [
"label_id,customer_id,time,my_labeling_function",
"2,2,2014-01-01 00:01:00,True",
"3,2,2014-01-01 00:04:00,False",
]
assert given_answer == answer
def test_sample_n_per_label(labels):
n = {True: 1, False: 2}
given_answer = labels.sample(n=n, random_state=0)
given_answer = given_answer.sort_index()
given_answer = to_csv(given_answer, index=True)
answer = [
"label_id,customer_id,time,my_labeling_function",
"1,1,2014-01-01 00:48:00,False",
"2,2,2014-01-01 00:01:00,True",
"3,2,2014-01-01 00:04:00,False",
]
assert given_answer == answer
def test_sample_frac_int(labels):
given_answer = labels.sample(frac=0.25, random_state=0)
given_answer = given_answer.sort_index()
given_answer = to_csv(given_answer, index=True)
answer = [
"label_id,customer_id,time,my_labeling_function",
"2,2,2014-01-01 00:01:00,True",
]
assert given_answer == answer
def test_sample_frac_per_label(labels):
frac = {True: 1.0, False: 0.5}
given_answer = labels.sample(frac=frac, random_state=0)
given_answer = given_answer.sort_index()
given_answer = to_csv(given_answer, index=True)
answer = [
"label_id,customer_id,time,my_labeling_function",
"0,1,2014-01-01 00:45:00,True",
"2,2,2014-01-01 00:01:00,True",
"3,2,2014-01-01 00:04:00,False",
]
assert given_answer == answer
def test_sample_in_transforms(labels):
n = {True: 2, False: 2}
transform = {
"transform": "sample",
"n": n,
"frac": None,
"random_state": None,
"replace": False,
"per_instance": False,
}
sample = labels.sample(n=n)
assert transform != labels.transforms[-1]
assert transform == sample.transforms[-1]
def test_sample_with_replacement(labels):
assert labels.shape[0] < 20
n = {True: 10, False: 10}
sample = labels.sample(n=n, replace=True)
assert sample.shape[0] == 20
def test_single_target(total_spent):
lt = total_spent.copy()
lt.target_columns.append("target_2")
match = "must first select an individual target"
with pytest.raises(AssertionError, match=match):
lt.sample(2)
def test_sample_n_per_instance():
data = read_csv(
[
"target_dataframe_index,labels",
"0,a",
"0,b",
"1,a",
"1,b",
],
)
lt = LabelTimes(data=data, target_dataframe_index="target_dataframe_index")
sample = lt.sample(n={"a": 1}, per_instance=True, random_state=0)
actual = to_csv(sample, index=False)
expected = [
"target_dataframe_index,labels",
"0,a",
"1,a",
]
assert expected == actual
def test_sample_frac_per_instance():
data = read_csv(
[
"target_dataframe_index,labels",
"0,a",
"0,a",
"0,a",
"0,a",
"1,a",
"1,a",
],
)
lt = LabelTimes(data=data, target_dataframe_index="target_dataframe_index")
sample = lt.sample(frac={"a": 0.5}, per_instance=True, random_state=0)
actual = to_csv(sample, index=False)
expected = [
"target_dataframe_index,labels",
"0,a",
"0,a",
"1,a",
]
assert expected == actual
================================================
FILE: composeml/tests/test_label_transforms/test_threshold.py
================================================
from pytest import raises
def test_threshold(labels):
labels = labels.threshold(200)
transform = labels.transforms[0]
assert transform["transform"] == "threshold"
assert transform["value"] == 200
answer = [True, False, True, False]
target_column = labels.target_columns[0]
given_answer = labels[target_column].values.tolist()
assert given_answer == answer
def test_single_target(total_spent):
lt = total_spent.copy()
lt.target_columns.append("target_2")
match = "must first select an individual target"
with raises(AssertionError, match=match):
lt.threshold(200)
================================================
FILE: composeml/tests/test_version.py
================================================
from composeml import __version__
def test_version():
assert __version__ == "0.10.1"
================================================
FILE: composeml/tests/utils.py
================================================
from io import StringIO
import pandas as pd
def read_csv(data, **kwargs):
"""Helper function for creating a dataframe from in-memory CSV string (or list of strings).
Args:
data (str or list) : CSV string(s)
Returns:
DataFrame : Instance of a dataframe.
"""
if isinstance(data, list):
data = "\n".join(data)
# This creates a file-like object for reading in CSV string.
with StringIO(data) as data:
df = pd.read_csv(data, **kwargs)
return df
def to_csv(label_times, **kwargs):
df = pd.DataFrame(label_times)
csv = df.to_csv(**kwargs)
return csv.splitlines()
================================================
FILE: composeml/update_checker.py
================================================
from pkg_resources import iter_entry_points
for entry_point in iter_entry_points("alteryx_open_src_initialize"):
try:
method = entry_point.load()
if callable(method):
method("composeml")
except Exception:
pass
================================================
FILE: composeml/version.py
================================================
__version__ = "0.10.1"
================================================
FILE: contributing.md
================================================
# Contributing to Compose
:+1::tada: First off, thank you for taking the time to contribute! :tada::+1:
Whether you are a novice or experienced software developer, all contributions and suggestions are welcome!
There are many ways to contribute to Compose, with the most common ones being contribution of code or documentation to the project.
**To contribute, you can:**
1. Help users on our [Slack channel](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA). Answer questions under the compose tag on [Stack Overflow](https://stackoverflow.com/questions/tagged/composeml)
2. Submit a pull request for one of [Good First Issues](https://github.com/alteryx/compose/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+First+Issue%22)
3. Make changes to the codebase, see [Contributing to the codebase](#Contributing-to-the-Codebase).
4. Improve our documentation, which can be found under the [docs](docs/) directory or at https://compose.alteryx.com/en/stable/
5. [Report issues](#Report-issues) you're facing, and give a "thumbs up" on issues that others reported and that are relevant to you. Issues should be used for bugs, and feature requests only.
6. Spread the word: reference Compose from your blog and articles, link to it from your website, or simply star it in [Compose GitHub page](https://github.com/alteryx/compose) to say "I use it".
## Contributing to the Codebase
Before starting major work, you should touch base with the maintainers of Compose by filing an issue on GitHub or posting a message in the [#development channel on Slack](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA). This will increase the likelihood your pull request will eventually get merged in.
#### 1. Fork and clone repo
* The code is hosted on GitHub, so you will need to use Git to fork the project and make changes to the codebase. To start, go to the [Compose GitHub page](https://github.com/alteryx/compose) and click the `Fork` button.
* After you have created the fork, you will want to clone the fork to your machine and connect your version of the project to the upstream Compose repo.
```bash
git clone https://github.com/your-user-name/compose.git
cd compose
git remote add upstream https://github.com/alteryx/compose
```
* Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. You can run the following steps to create a separate virtual environment, and install Compose in editable mode.
```bash
python -m venv venv
source venv/bin/activate
make installdeps
git checkout -b issue####-branch_name
```
#### 2. Implement your Pull Request
* Implement your pull request. If needed, add new tests or update the documentation.
* Before submitting to GitHub, verify the tests run and the code lints properly
```bash
# runs linting
make lint
# will fix some common linting issues automatically
make lint-fix
# runs test
make test
```
* If you made changes to the documentation, build the documentation locally.
```bash
# go to docs and build
cd docs
make html
# view docs locally
open build/html/index.html
```
#### 3. Submit your Pull Request
* Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request.
* If you need to update your code with the latest changes from the main Compose repo, you can do that by running the commands below, which will merge the latest changes from the Compose `main` branch into your current local branch. You may need to resolve merge conflicts if there are conflicts between your changes and the upstream changes. After the merge, you will need to push the updates to your forked repo after running these commands.
```bash
git fetch upstream
git merge upstream/main
```
* Create a pull request to merge the changes from your forked repo branch into the Compose `main` branch. Creating the pull request will automatically run our continuous integration.
* If this is your first contribution, you will need to sign the Contributor License Agreement as directed.
* Update the "Future Release" section of the release notes (`docs/source/release_notes.rst`) to include your pull request and add your github username to the list of contributors. Add a description of your PR to the subsection that most closely matches your contribution:
* Enhancements: new features or additions to Compose.
* Fixes: things like bugfixes or adding more descriptive error messages.
* Changes: modifications to an existing part of Compose.
* Documentation Changes
* Testing Changes
Documentation or testing changes rarely warrant an individual release notes entry; the PR number can be added to their respective "Miscellaneous changes" entries.
* We will review your changes, and you will most likely be asked to make additional changes before it is finally ready to merge. However, once it's reviewed by a maintainer of Compose, passes continuous integration, we will merge it, and you will have successfully contributed to Compose!
## Report issues
When reporting issues please include as much detail as possible about your operating system, Compose version and python version. Whenever possible, please also include a brief, self-contained code example that demonstrates the problem.
================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/make.bat
================================================
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
:end
popd
================================================
FILE: docs/source/_static/style.css
================================================
.footer {
background-color: #0D2345;
padding-bottom: 40px;
padding-top: 40px;
width: 100%;
}
.footer-cell-1 {
grid-row: 1;
grid-column: 1 / 3;
}
.footer-cell-2 {
grid-row: 1;
grid-column: 4;
margin-bottom: 15px;
text-align: right;
}
.footer-cell-3 {
grid-row: 2;
grid-column: 1 / 5;
}
.footer-cell-4 {
grid-row: 3;
grid-column: 1 / 3;
}
.footer-container {
display: grid;
margin-left: 10%;
margin-right: 10%;
}
.footer-image-alteryx {
padding-top: 22px;
width: 270px;
}
.footer-image-copyright {
width: 180px;
}
.footer-image-github {
width: 50px;
}
.footer-image-twitter {
width: 60px;
}
.footer-line {
border-top: 2px solid white;
margin-left: 7px;
margin-right: 15px;
}
================================================
FILE: docs/source/_templates/class.rst
================================================
{{ fullname | escape | underline}}
.. currentmodule:: {{ module }}
.. autoclass:: {{ objname }}
{% block methods %}
{% if methods %}
.. rubric:: Methods
.. autosummary::
:nosignatures:
:toctree: methods
{% for item in methods %}
{%- if item not in inherited_members %}
~{{ name }}.{{ item }}
{%- endif %}
{%- endfor %}
{% endif %}
{% endblock %}
================================================
FILE: docs/source/_templates/layout.html
================================================
{% extends "!layout.html" %}
{%- block extrahead %}
<script>
!function () {
var analytics = window.analytics = window.analytics || []; if (!analytics.initialize) if (analytics.invoked) window.console && console.error && console.error("Segment snippet included twice."); else {
analytics.invoked = !0; analytics.methods = ["trackSubmit", "trackClick", "trackLink", "trackForm", "pageview", "identify", "reset", "group", "track", "ready", "alias", "debug", "page", "once", "off", "on"]; analytics.factory = function (t) { return function () { var e = Array.prototype.slice.call(arguments); e.unshift(t); analytics.push(e); return analytics } }; for (var t = 0; t < analytics.methods.length; t++) { var e = analytics.methods[t]; analytics[e] = analytics.factory(e) } analytics.load = function (t, e) { var n = document.createElement("script"); n.type = "text/javascript"; n.async = !0; n.src = "https://cdn.segment.com/analytics.js/v1/" + t + "/analytics.min.js"; var a = document.getElementsByTagName("script")[0]; a.parentNode.insertBefore(n, a); analytics._loadOptions = e }; analytics.SNIPPET_VERSION = "4.1.0";
analytics.load("ze8imyBlahLiQl1WxZCnHzhNWgviYKOn");
analytics.page();
}
}();
</script>
{% set image = 'https://alteryx-oss-web-images.s3.amazonaws.com/compose_open_graph.png' %}
{% set description = 'A machine learning tool for automated prediction engineering' %}
{% if meta is defined %}
{% if meta.description is defined %}
{% set description = meta.description %}
{% endif %}
{% endif %}
<meta property="og:title" content="{{ title|striptags|e }}{{ titlesuffix }}">
<meta content="{{description}}" />
<meta property="og:description" content="{{description}}">
<meta property="og:image" content="{{image}}">
<meta property="twitter:image" content="{{image}}">
<meta name="twitter:card" content="summary_large_image">
{% endblock %}
{%- block footer %}
<footer class="footer">
<div class="footer-container">
<div class="footer-cell-1">
<img class="footer-image-alteryx" src="{{ pathto('_static/images/alteryx_open_source.svg', 1) }}" alt="Alteryx Open Source">
</div>
<div class="footer-cell-2">
<a href="https://github.com/alteryx/compose" target="_blank">
<img class="footer-image-github" src="{{ pathto('_static/images/github.svg', 1) }}" alt="GitHub">
</a>
<a href="https://twitter.com/AlteryxOSS" target="_blank">
<img class="footer-image-twitter" src="{{ pathto('_static/images/twitter.svg', 1) }}" alt="Twitter">
</a>
</div>
<div class="footer-cell-3">
<hr class="footer-line">
</div>
<div class="footer-cell-4">
<img class="footer-image-copyright" src="{{ pathto('_static/images/copyright.svg', 1) }}" alt="Copyright">
</div>
</div>
</footer>
{% endblock %}
================================================
FILE: docs/source/api_reference.rst
================================================
.. currentmodule:: composeml
=============
API Reference
=============
Label Maker
===========
.. autosummary::
:toctree: generated
:template: class.rst
:nosignatures:
LabelMaker
Label Times
============
.. autosummary::
:toctree: generated
:template: class.rst
:nosignatures:
LabelTimes
Transform Methods
-----------------
.. autosummary::
:nosignatures:
LabelTimes.apply_lead
LabelTimes.bin
LabelTimes.sample
LabelTimes.threshold
.. currentmodule:: composeml.label_times.plots
Label Plots
===========
.. autosummary::
:toctree: generated
:template: class.rst
:nosignatures:
LabelPlots
Plotting Methods
----------------
.. autosummary::
:nosignatures:
LabelPlots.count_by_time
LabelPlots.distribution
================================================
FILE: docs/source/conf.py
================================================
# -*- coding: utf-8 -*-
#
# Configuration file for the Sphinx documentation builder.
#
# This file does only contain a selection of the most common options. For a
# full list see the documentation:
# http://www.sphinx-doc.org/en/master/config
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
from composeml import __version__ as version
# -- Project information -----------------------------------------------------
project = "Compose"
copyright = "2020, Alteryx, Inc."
author = "Alteryx, Inc."
# The full version, including alpha/beta/rc tags
release = version
# -- General configuration ---------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"nbsphinx",
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.intersphinx",
"sphinx.ext.napoleon",
"sphinx.ext.viewcode",
"sphinx.ext.extlinks",
"sphinx_inline_tabs",
"sphinx_copybutton",
"myst_parser",
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = ".rst"
# The master toctree document.
master_doc = "index"
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ["**.ipynb_checkpoints"]
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "pydata_sphinx_theme"
html_logo = "images/compose_nav2.png"
html_favicon = "images/favicon.ico"
html_theme_options = {
"icon_links": [
{
"name": "GitHub",
"url": "https://github.com/alteryx/compose",
"icon": "fab fa-github-square",
"type": "fontawesome",
},
{
"name": "Twitter",
"url": "https://twitter.com/AlteryxOSS",
"icon": "fab fa-twitter-square",
"type": "fontawesome",
},
{
"name": "Slack",
"url": "https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA",
"icon": "fab fa-slack",
"type": "fontawesome",
},
],
"collapse_navigation": False,
"navigation_depth": 2,
}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]
# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# The default sidebars (for documents that don't match any pattern) are
# defined by theme itself. Builtin themes are using these templates by
# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
# 'searchbox.html']``.
#
# html_sidebars = {}
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = "Composedoc"
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, "Compose.tex", "Compose Documentation", "Alteryx, Inc.", "manual"),
]
# -- Options for manual page output ------------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [(master_doc, "composeml", "Compose Documentation", [author], 1)]
# -- Options for Texinfo output ----------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(
master_doc,
"Compose",
"Compose Documentation",
author,
"Compose",
"One line description of project.",
"Miscellaneous",
),
]
# -- Options for Epub output -------------------------------------------------
# Bibliographic Dublin Core info.
epub_title = project
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''
# A unique identification for the text.
#
# epub_uid = ''
# A list of files that should not be packed into the epub file.
epub_exclude_files = ["search.html"]
# -- Options for Markdown files ----------------------------------------------
myst_admonition_enable = True
myst_deflist_enable = True
myst_heading_anchors = 3
# -- Options for Sphinx Copy Button ------------------------------------------
copybutton_prompt_text = "myinputprompt"
copybutton_prompt_text = r">>> |\.\.\. |\$ |In \[\d*\]: | {2,5}\.\.\.: | {5,8}: "
copybutton_prompt_is_regexp = True
# -- Extension configuration -------------------------------------------------
extlinks = {
"issue": ("https://github.com/alteryx/compose/issues/%s", "#"),
"pr": ("https://github.com/alteryx/compose/pull/%s", "#"),
"user": ("https://github.com/%s", "@"),
}
autosummary_generate = ["api_reference.rst"]
templates_path = ["_templates"]
def setup(app):
app.add_css_file("style.css")
html_show_sphinx = False
================================================
FILE: docs/source/examples/demo/__init__.py
================================================
import os
import warnings
warnings.filterwarnings("ignore")
PWD = os.path.dirname(__file__)
================================================
FILE: docs/source/examples/demo/chicago_bike/__init__.py
================================================
from demo import PWD
from pandas import read_csv
from os.path import join
PWD = join(PWD, "chicago_bike")
def _read(file):
return read_csv(
join(PWD, file),
parse_dates=["starttime", "stoptime"],
index_col="trip_id",
)
def load_sample():
return _read("sample.csv")
================================================
FILE: docs/source/examples/demo/chicago_bike/sample.csv
================================================
trip_id,gender,starttime,stoptime,tripduration,temperature,events,from_station_id,dpcapacity_start,to_station_id,dpcapacity_end
2331610,Female,2014-06-29 13:35:00,2014-06-29 13:56:00,20.75,82.9,cloudy,178,15.0,76,39.0
2347603,Female,2014-06-30 12:07:00,2014-06-30 12:37:00,30.15,82.0,cloudy,211,19.0,177,15.0
2345120,Male,2014-06-30 08:36:00,2014-06-30 08:43:00,6.516666666666668,75.0,cloudy,340,15.0,67,15.0
2347527,Male,2014-06-30 12:00:00,2014-06-30 12:08:00,7.25,82.0,cloudy,56,19.0,56,19.0
2344421,Male,2014-06-30 08:04:00,2014-06-30 08:11:00,7.316666666666666,75.0,cloudy,77,23.0,37,19.0
2336431,Male,2014-06-29 16:31:00,2014-06-29 16:53:00,21.816666666666666,84.9,cloudy,349,15.0,13,19.0
2351574,Male,2014-06-30 16:55:00,2014-06-30 17:07:00,11.783333333333333,84.0,cloudy,190,15.0,93,15.0
2351672,Male,2014-06-30 16:58:00,2014-06-30 17:13:00,15.533333333333333,84.0,cloudy,37,19.0,289,19.0
2351751,Male,2014-06-30 17:00:00,2014-06-30 17:07:00,7.566666666666666,84.0,cloudy,169,15.0,134,19.0
2331505,Female,2014-06-29 13:29:00,2014-06-29 13:36:00,7.716666666666668,82.9,cloudy,181,31.0,106,27.0
2336748,Male,2014-06-29 16:44:00,2014-06-29 17:05:00,21.08333333333333,84.9,cloudy,268,15.0,232,23.0
2350914,Male,2014-06-30 16:27:00,2014-06-30 16:36:00,8.3,84.0,cloudy,49,27.0,191,23.0
2341339,Female,2014-06-29 20:48:00,2014-06-29 21:15:00,26.88333333333333,81.0,cloudy,99,19.0,62,27.0
2352857,Female,2014-06-30 17:36:00,2014-06-30 17:40:00,4.4,84.0,cloudy,304,15.0,303,15.0
2340830,Female,2014-06-29 20:11:00,2014-06-29 20:19:00,7.5166666666666675,81.0,cloudy,123,15.0,116,15.0
2343445,Male,2014-06-30 07:00:00,2014-06-30 07:03:00,3.083333333333333,73.0,cloudy,74,23.0,48,27.0
2338352,Male,2014-06-29 17:59:00,2014-06-29 18:11:00,11.7,84.9,cloudy,84,19.0,134,19.0
2338548,Male,2014-06-29 18:07:00,2014-06-29 18:41:00,33.9,84.2,cloudy,349,15.0,73,19.0
2338565,Male,2014-06-29 18:08:00,2014-06-29 18:25:00,17.183333333333334,84.2,cloudy,176,19.0,152,15.0
2340701,Male,2014-06-29 19:58:00,2014-06-29 20:28:00,30.36666666666667,82.0,cloudy,234,19.0,293,19.0
2338809,Male,2014-06-29 18:22:00,2014-06-29 18:43:00,21.233333333333334,84.2,cloudy,254,15.0,249,15.0
2349824,Male,2014-06-30 15:15:00,2014-06-30 15:29:00,13.45,84.9,cloudy,283,23.0,186,15.0
2352795,Male,2014-06-30 17:34:00,2014-06-30 17:51:00,17.233333333333334,84.0,cloudy,168,19.0,168,19.0
2350998,Male,2014-06-30 16:26:00,2014-06-30 16:35:00,9.2,84.0,cloudy,52,31.0,91,31.0
2349370,Male,2014-06-30 14:37:00,2014-06-30 14:48:00,11.433333333333335,87.1,cloudy,290,15.0,213,15.0
2354718,Male,2014-06-30 19:49:00,2014-06-30 20:07:00,18.733333333333334,73.0,tstorms,55,15.0,55,15.0
2351674,Male,2014-06-30 16:58:00,2014-06-30 17:06:00,8.1,84.0,cloudy,264,19.0,66,19.0
2338317,Male,2014-06-29 17:57:00,2014-06-29 18:34:00,37.3,84.9,cloudy,263,11.0,75,23.0
2342680,Male,2014-06-29 23:22:00,2014-06-29 23:28:00,5.416666666666668,78.1,cloudy,350,15.0,214,15.0
2341145,Male,2014-06-29 20:33:00,2014-06-29 21:02:00,28.766666666666666,81.0,cloudy,341,19.0,150,11.0
2352020,Male,2014-06-30 17:06:00,2014-06-30 17:15:00,9.116666666666667,84.0,cloudy,264,19.0,212,31.0
2334784,Male,2014-06-29 15:27:00,2014-06-29 15:54:00,26.91666666666667,82.9,cloudy,324,15.0,295,15.0
2344311,Male,2014-06-30 07:57:00,2014-06-30 08:11:00,14.133333333333333,73.0,cloudy,75,23.0,35,39.0
2343074,Male,2014-06-30 05:35:00,2014-06-30 05:41:00,5.9833333333333325,73.0,cloudy,332,15.0,327,19.0
2341720,Female,2014-06-29 21:15:00,2014-06-29 21:19:00,3.5166666666666666,79.0,cloudy,248,15.0,322,15.0
2353889,Male,2014-06-30 18:18:00,2014-06-30 18:21:00,2.9166666666666665,82.0,tstorms,309,11.0,158,15.0
2342764,Male,2014-06-29 23:50:00,2014-06-30 00:13:00,23.48333333333333,78.1,cloudy,244,19.0,303,15.0
2352400,Female,2014-06-30 17:20:00,2014-06-30 17:42:00,21.95,84.0,cloudy,81,39.0,340,15.0
2331873,Male,2014-06-29 13:46:00,2014-06-29 14:10:00,24.91666666666667,82.9,cloudy,156,15.0,94,19.0
2351727,Male,2014-06-30 16:59:00,2014-06-30 17:16:00,16.633333333333333,84.0,cloudy,134,19.0,69,19.0
2345935,Male,2014-06-30 09:22:00,2014-06-30 09:30:00,7.6,78.1,cloudy,255,31.0,90,35.0
2353412,Male,2014-06-30 17:56:00,2014-06-30 18:23:00,26.88333333333333,84.0,cloudy,51,31.0,223,15.0
2349890,Male,2014-06-30 15:22:00,2014-06-30 15:30:00,8.0,84.9,cloudy,181,31.0,106,27.0
2336202,Male,2014-06-29 16:22:00,2014-06-29 16:29:00,6.833333333333332,84.9,cloudy,268,15.0,143,15.0
2332401,Male,2014-06-29 14:04:00,2014-06-29 14:32:00,28.3,84.0,cloudy,312,15.0,94,19.0
2348888,Male,2014-06-30 13:56:00,2014-06-30 13:59:00,3.15,84.9,cloudy,60,19.0,93,15.0
2347827,Male,2014-06-30 12:20:00,2014-06-30 12:31:00,11.85,82.0,cloudy,287,27.0,291,19.0
2335752,Male,2014-06-29 16:05:00,2014-06-29 16:12:00,7.05,84.9,cloudy,144,15.0,87,19.0
2346051,Male,2014-06-30 09:33:00,2014-06-30 09:49:00,16.85,78.1,cloudy,120,15.0,51,31.0
2348804,Male,2014-06-30 13:49:00,2014-06-30 13:57:00,8.366666666666667,84.9,cloudy,51,31.0,26,31.0
2337626,Male,2014-06-29 17:22:00,2014-06-29 17:30:00,8.1,84.9,cloudy,16,11.0,309,11.0
2336628,Male,2014-06-29 16:39:00,2014-06-29 16:55:00,16.25,84.9,cloudy,154,15.0,69,19.0
2349449,Male,2014-06-30 14:44:00,2014-06-30 14:49:00,5.466666666666668,87.1,cloudy,185,11.0,290,15.0
2352026,Male,2014-06-30 17:08:00,2014-06-30 17:18:00,9.366666666666667,84.0,cloudy,195,31.0,91,31.0
2339115,Male,2014-06-29 18:36:00,2014-06-29 18:41:00,4.35,84.2,cloudy,114,27.0,232,23.0
2345532,Male,2014-06-30 08:56:00,2014-06-30 09:00:00,3.9166666666666665,75.0,cloudy,174,23.0,98,15.0
2332243,Male,2014-06-29 13:58:00,2014-06-29 14:24:00,26.33333333333333,82.9,cloudy,333,15.0,93,15.0
2346818,Male,2014-06-30 10:52:00,2014-06-30 10:59:00,6.833333333333332,78.1,cloudy,48,27.0,291,19.0
2353493,Male,2014-06-30 17:59:00,2014-06-30 18:17:00,17.933333333333334,84.0,cloudy,66,19.0,69,19.0
2337808,Female,2014-06-29 17:31:00,2014-06-29 17:47:00,16.416666666666668,84.9,cloudy,35,39.0,255,31.0
2354440,Male,2014-06-30 18:52:00,2014-06-30 18:56:00,4.166666666666667,82.0,tstorms,75,23.0,198,19.0
2344857,Male,2014-06-30 08:25:00,2014-06-30 08:35:00,10.116666666666667,75.0,cloudy,77,23.0,37,19.0
2338197,Male,2014-06-29 17:51:00,2014-06-29 18:09:00,18.766666666666666,84.9,cloudy,177,15.0,99,19.0
2349939,Male,2014-06-30 15:24:00,2014-06-30 15:30:00,5.4,84.9,cloudy,72,15.0,338,15.0
2347092,Male,2014-06-30 11:22:00,2014-06-30 11:32:00,9.733333333333333,79.0,cloudy,110,23.0,194,11.0
2347136,Female,2014-06-30 11:26:00,2014-06-30 11:31:00,4.966666666666667,79.0,cloudy,74,23.0,181,31.0
2342887,Male,2014-06-30 00:38:00,2014-06-30 00:43:00,5.616666666666666,78.1,cloudy,226,15.0,300,15.0
2344767,Female,2014-06-30 08:20:00,2014-06-30 08:41:00,21.116666666666667,75.0,cloudy,220,19.0,173,15.0
2351335,Male,2014-06-30 16:47:00,2014-06-30 16:53:00,6.1,84.0,cloudy,51,31.0,192,39.0
2348169,Male,2014-06-30 12:54:00,2014-06-30 13:02:00,8.4,82.0,cloudy,66,19.0,110,23.0
2354487,Male,2014-06-30 18:56:00,2014-06-30 19:05:00,8.833333333333334,82.0,tstorms,118,19.0,34,15.0
2345814,Male,2014-06-30 09:13:00,2014-06-30 09:30:00,16.916666666666668,78.1,cloudy,24,15.0,90,35.0
2344619,Male,2014-06-30 08:14:00,2014-06-30 08:38:00,23.66666666666667,75.0,cloudy,131,15.0,110,23.0
2336070,Female,2014-06-29 16:16:00,2014-06-29 16:36:00,19.983333333333334,84.9,cloudy,97,35.0,137,15.0
2353002,Male,2014-06-30 17:40:00,2014-06-30 17:46:00,5.133333333333334,84.0,cloudy,210,19.0,183,15.0
2334944,Male,2014-06-29 15:33:00,2014-06-29 16:09:00,35.65,82.9,cloudy,249,15.0,234,19.0
2352711,Male,2014-06-30 17:31:00,2014-06-30 17:41:00,9.916666666666666,84.0,cloudy,212,31.0,192,39.0
2346848,Male,2014-06-30 10:55:00,2014-06-30 11:02:00,7.116666666666666,78.1,cloudy,240,23.0,117,23.0
2334688,Male,2014-06-29 15:23:00,2014-06-29 15:46:00,23.466666666666665,82.9,cloudy,324,15.0,85,23.0
2346252,Female,2014-06-30 09:50:00,2014-06-30 09:58:00,8.016666666666667,78.1,cloudy,144,15.0,60,19.0
2343617,Male,2014-06-30 07:16:00,2014-06-30 07:22:00,5.85,73.0,cloudy,292,11.0,229,19.0
2344957,Male,2014-06-30 08:29:00,2014-06-30 08:40:00,11.433333333333335,75.0,cloudy,232,23.0,251,15.0
2339764,Male,2014-06-29 19:12:00,2014-06-29 19:28:00,16.116666666666667,82.0,cloudy,164,23.0,30,15.0
2339479,Female,2014-06-29 18:54:00,2014-06-29 19:13:00,18.666666666666668,84.2,cloudy,225,15.0,157,15.0
2344240,Female,2014-06-30 07:52:00,2014-06-30 08:01:00,9.35,73.0,cloudy,310,11.0,87,19.0
2351425,Female,2014-06-30 16:50:00,2014-06-30 16:52:00,2.083333333333333,84.0,cloudy,314,15.0,244,19.0
2351605,Male,2014-06-30 16:53:00,2014-06-30 17:05:00,11.95,84.0,cloudy,120,15.0,279,15.0
2334617,Male,2014-06-29 15:21:00,2014-06-29 15:59:00,38.05,82.9,cloudy,346,15.0,215,15.0
2338547,Male,2014-06-29 18:07:00,2014-06-29 18:21:00,14.466666666666667,84.2,cloudy,176,19.0,71,15.0
2340148,Male,2014-06-29 19:33:00,2014-06-29 19:41:00,7.466666666666668,82.0,cloudy,94,19.0,127,15.0
2354544,Male,2014-06-30 19:02:00,2014-06-30 19:09:00,6.716666666666668,73.0,tstorms,344,15.0,234,19.0
2354549,Male,2014-06-30 19:03:00,2014-06-30 19:09:00,6.65,73.0,tstorms,114,27.0,347,15.0
2343237,Male,2014-06-30 06:35:00,2014-06-30 06:52:00,17.6,73.0,cloudy,168,19.0,76,39.0
2353805,Male,2014-06-30 18:14:00,2014-06-30 18:19:00,5.633333333333334,82.0,tstorms,91,31.0,80,19.0
2345494,Female,2014-06-30 08:55:00,2014-06-30 09:05:00,9.933333333333334,75.0,cloudy,192,39.0,43,43.0
2345516,Male,2014-06-30 08:55:00,2014-06-30 09:00:00,5.166666666666667,75.0,cloudy,77,23.0,80,19.0
2349455,Female,2014-06-30 14:44:00,2014-06-30 15:04:00,19.85,87.1,cloudy,71,15.0,291,19.0
2352905,Male,2014-06-30 17:38:00,2014-06-30 18:01:00,23.4,84.0,cloudy,90,35.0,274,15.0
2349049,Male,2014-06-30 14:10:00,2014-06-30 14:24:00,13.366666666666667,87.1,cloudy,44,27.0,198,19.0
2333292,Male,2014-06-29 14:34:00,2014-06-29 14:49:00,15.1,84.0,cloudy,115,23.0,165,19.0
2342336,Male,2014-06-29 22:28:00,2014-06-29 22:38:00,9.45,78.1,cloudy,154,15.0,246,11.0
2345149,Male,2014-06-30 08:38:00,2014-06-30 08:47:00,9.3,75.0,cloudy,59,19.0,170,15.0
2338660,Female,2014-06-29 18:12:00,2014-06-29 18:43:00,31.0,84.2,cloudy,177,15.0,332,15.0
2341225,Male,2014-06-29 20:39:00,2014-06-29 21:14:00,35.016666666666666,81.0,cloudy,85,23.0,14,15.0
2343302,Female,2014-06-30 06:41:00,2014-06-30 06:48:00,6.8,73.0,cloudy,276,11.0,69,19.0
2344562,Male,2014-06-30 08:12:00,2014-06-30 08:18:00,6.65,75.0,cloudy,43,43.0,174,23.0
2343592,Female,2014-06-30 07:09:00,2014-06-30 07:36:00,26.65,73.0,cloudy,297,15.0,93,15.0
2344473,Male,2014-06-30 08:06:00,2014-06-30 08:13:00,6.45,75.0,cloudy,287,27.0,91,31.0
2350673,Male,2014-06-30 16:12:00,2014-06-30 16:25:00,12.133333333333333,84.0,cloudy,51,31.0,301,19.0
2353814,Male,2014-06-30 18:14:00,2014-06-30 18:29:00,15.216666666666667,82.0,tstorms,168,19.0,43,43.0
2332398,Female,2014-06-29 14:04:00,2014-06-29 14:10:00,5.766666666666668,84.0,cloudy,209,11.0,120,15.0
2334744,Male,2014-06-29 15:26:00,2014-06-29 15:34:00,8.25,82.9,cloudy,198,19.0,66,19.0
2348254,Male,2014-06-30 13:03:00,2014-06-30 13:14:00,11.716666666666667,84.9,cloudy,318,15.0,311,15.0
2343661,Male,2014-06-30 07:20:00,2014-06-30 07:30:00,10.516666666666667,73.0,cloudy,15,15.0,280,11.0
2353389,Male,2014-06-30 17:55:00,2014-06-30 18:02:00,6.8,84.0,cloudy,210,19.0,305,15.0
2348103,Female,2014-06-30 12:50:00,2014-06-30 12:56:00,5.9,82.0,cloudy,113,15.0,331,19.0
2353703,Male,2014-06-30 18:09:00,2014-06-30 18:32:00,23.48333333333333,82.0,tstorms,52,31.0,16,11.0
2343517,Male,2014-06-30 07:08:00,2014-06-30 07:18:00,10.016666666666667,73.0,cloudy,50,27.0,195,31.0
2351496,Female,2014-06-30 16:52:00,2014-06-30 16:59:00,6.55,84.0,cloudy,244,19.0,308,11.0
2332643,Female,2014-06-29 14:11:00,2014-06-29 14:20:00,9.166666666666666,84.0,cloudy,291,19.0,212,31.0
2352918,Female,2014-06-30 17:38:00,2014-06-30 17:49:00,11.366666666666667,84.0,cloudy,174,23.0,90,35.0
2345454,Male,2014-06-30 08:52:00,2014-06-30 08:58:00,6.0,75.0,cloudy,36,31.0,100,23.0
2352036,Female,2014-06-30 17:08:00,2014-06-30 17:13:00,4.733333333333333,84.0,cloudy,130,15.0,16,11.0
2340128,Male,2014-06-29 19:32:00,2014-06-29 19:49:00,17.583333333333332,82.0,cloudy,76,39.0,341,19.0
2352250,Male,2014-06-30 17:15:00,2014-06-30 17:20:00,5.15,84.0,cloudy,118,19.0,288,11.0
2344345,Male,2014-06-30 08:00:00,2014-06-30 08:07:00,7.216666666666668,75.0,cloudy,37,19.0,194,11.0
2338757,Male,2014-06-29 18:18:00,2014-06-29 18:34:00,15.5,84.2,cloudy,174,23.0,22,15.0
2343113,Male,2014-06-30 06:02:00,2014-06-30 06:06:00,4.033333333333333,73.0,cloudy,153,19.0,115,23.0
2337013,Female,2014-06-29 16:54:00,2014-06-29 17:20:00,25.7,84.9,cloudy,118,19.0,326,11.0
2339012,Female,2014-06-29 18:31:00,2014-06-29 19:10:00,38.38333333333333,84.2,cloudy,34,15.0,114,27.0
2343454,Male,2014-06-30 07:02:00,2014-06-30 07:06:00,4.633333333333334,73.0,cloudy,69,19.0,160,15.0
2332706,Female,2014-06-29 14:14:00,2014-06-29 14:17:00,3.4166666666666665,84.0,cloudy,302,19.0,152,15.0
2353305,Male,2014-06-30 17:51:00,2014-06-30 17:56:00,5.0,84.0,cloudy,195,31.0,81,39.0
2344660,Male,2014-06-30 08:16:00,2014-06-30 08:24:00,8.55,75.0,cloudy,191,23.0,181,31.0
2352761,Male,2014-06-30 17:33:00,2014-06-30 17:40:00,7.2,84.0,cloudy,71,15.0,75,23.0
2353277,Male,2014-06-30 17:50:00,2014-06-30 18:08:00,17.616666666666667,84.0,cloudy,198,19.0,183,15.0
2340101,Male,2014-06-29 19:30:00,2014-06-29 19:48:00,18.366666666666667,82.0,cloudy,286,23.0,130,15.0
2351677,Female,2014-06-30 16:57:00,2014-06-30 17:07:00,9.466666666666667,84.0,cloudy,261,15.0,21,15.0
2346809,Male,2014-06-30 10:52:00,2014-06-30 11:08:00,16.4,78.1,cloudy,303,15.0,238,15.0
2351366,Male,2014-06-30 16:48:00,2014-06-30 17:12:00,23.83333333333333,84.0,cloudy,126,15.0,158,15.0
2337456,Female,2014-06-29 17:14:00,2014-06-29 17:17:00,3.2666666666666666,84.9,cloudy,219,11.0,310,11.0
2345074,Male,2014-06-30 08:34:00,2014-06-30 08:38:00,3.966666666666667,75.0,cloudy,272,11.0,147,15.0
2347405,Male,2014-06-30 11:47:00,2014-06-30 11:54:00,6.4,79.0,cloudy,19,15.0,342,15.0
2339183,Male,2014-06-29 18:41:00,2014-06-29 19:00:00,19.233333333333334,84.2,cloudy,177,15.0,26,31.0
2352683,Female,2014-06-30 17:28:00,2014-06-30 17:47:00,19.683333333333334,84.0,cloudy,91,31.0,255,31.0
2347624,Male,2014-06-30 12:09:00,2014-06-30 12:20:00,11.8,82.0,cloudy,112,15.0,53,19.0
2338311,Female,2014-06-29 17:57:00,2014-06-29 18:11:00,14.2,84.9,cloudy,56,19.0,57,15.0
2351400,Male,2014-06-30 16:50:00,2014-06-30 17:04:00,14.583333333333336,84.0,cloudy,106,27.0,91,31.0
2351143,Male,2014-06-30 16:38:00,2014-06-30 16:45:00,7.216666666666668,84.0,cloudy,264,19.0,164,23.0
2339778,Female,2014-06-29 19:13:00,2014-06-29 19:21:00,8.016666666666667,82.0,cloudy,344,15.0,297,15.0
2334632,Male,2014-06-29 15:22:00,2014-06-29 15:37:00,15.033333333333333,82.9,cloudy,334,19.0,289,19.0
2350917,Male,2014-06-30 16:27:00,2014-06-30 16:29:00,2.066666666666667,84.0,cloudy,152,15.0,302,19.0
2341760,Female,2014-06-29 21:20:00,2014-06-29 21:26:00,6.216666666666668,79.0,cloudy,300,15.0,329,15.0
2332198,Male,2014-06-29 13:56:00,2014-06-29 14:12:00,15.716666666666667,82.9,cloudy,315,11.0,290,15.0
2344977,Male,2014-06-30 08:30:00,2014-06-30 08:34:00,3.95,75.0,cloudy,28,15.0,118,19.0
2352320,Female,2014-06-30 17:17:00,2014-06-30 17:45:00,27.73333333333333,84.0,cloudy,249,15.0,324,15.0
2338104,Male,2014-06-29 17:44:00,2014-06-29 18:04:00,19.9,84.9,cloudy,165,19.0,141,23.0
2341930,Female,2014-06-29 21:36:00,2014-06-29 21:47:00,11.2,79.0,cloudy,268,15.0,113,15.0
2348512,Female,2014-06-30 13:25:00,2014-06-30 13:36:00,10.166666666666666,84.9,cloudy,20,15.0,92,19.0
2352406,Male,2014-06-30 17:20:00,2014-06-30 17:25:00,4.733333333333333,84.0,cloudy,284,23.0,321,19.0
2338071,Male,2014-06-29 17:43:00,2014-06-29 18:08:00,25.08333333333333,84.9,cloudy,16,11.0,340,15.0
2352448,Male,2014-06-30 17:21:00,2014-06-30 17:48:00,27.0,84.0,cloudy,164,23.0,15,15.0
2343943,Male,2014-06-30 07:39:00,2014-06-30 07:50:00,11.316666666666665,73.0,cloudy,301,19.0,49,27.0
2352087,Male,2014-06-30 17:07:00,2014-06-30 17:23:00,16.05,84.0,cloudy,91,31.0,138,15.0
2339857,Male,2014-06-29 19:17:00,2014-06-29 19:36:00,19.533333333333328,82.0,cloudy,117,23.0,324,15.0
2332104,Female,2014-06-29 13:54:00,2014-06-29 14:17:00,23.5,82.9,cloudy,277,15.0,199,15.0
2350106,Male,2014-06-30 15:35:00,2014-06-30 15:51:00,16.366666666666667,84.9,cloudy,33,27.0,84,19.0
2344762,Male,2014-06-30 08:20:00,2014-06-30 08:29:00,8.566666666666666,75.0,cloudy,43,43.0,174,23.0
2351014,Female,2014-06-30 16:31:00,2014-06-30 16:43:00,12.25,84.0,cloudy,110,23.0,192,39.0
2338425,Female,2014-06-29 18:02:00,2014-06-29 18:14:00,11.483333333333333,84.2,cloudy,255,31.0,90,35.0
2343236,Female,2014-06-30 06:35:00,2014-06-30 06:40:00,5.016666666666667,73.0,cloudy,190,15.0,67,15.0
2339713,Female,2014-06-29 19:09:00,2014-06-29 19:21:00,11.316666666666665,82.0,cloudy,302,19.0,230,19.0
2331669,Female,2014-06-29 13:38:00,2014-06-29 13:52:00,13.916666666666664,82.9,cloudy,13,19.0,20,15.0
2353506,Male,2014-06-30 18:00:00,2014-06-30 18:09:00,9.416666666666666,82.0,tstorms,177,15.0,156,15.0
2352319,Female,2014-06-30 17:17:00,2014-06-30 17:26:00,9.35,84.0,cloudy,43,43.0,5,19.0
2352719,Male,2014-06-30 17:31:00,2014-06-30 17:37:00,5.866666666666666,84.0,cloudy,138,15.0,289,19.0
2336553,Female,2014-06-29 16:36:00,2014-06-29 16:48:00,12.416666666666664,84.9,cloudy,332,15.0,250,19.0
2353056,Female,2014-06-30 17:42:00,2014-06-30 17:51:00,8.766666666666667,84.0,cloudy,176,19.0,94,19.0
2335260,Male,2014-06-29 15:47:00,2014-06-29 16:13:00,26.1,82.9,cloudy,264,19.0,268,15.0
2351678,Female,2014-06-30 16:58:00,2014-06-30 17:01:00,3.333333333333333,84.0,cloudy,213,15.0,159,9.0
2340477,Male,2014-06-29 19:51:00,2014-06-29 20:05:00,13.916666666666664,82.0,cloudy,327,19.0,299,15.0
2341171,Male,2014-06-29 20:35:00,2014-06-29 20:46:00,10.983333333333333,81.0,cloudy,93,15.0,153,19.0
2350877,Female,2014-06-30 16:25:00,2014-06-30 16:40:00,14.716666666666667,84.0,cloudy,137,15.0,280,11.0
2351710,Male,2014-06-30 16:59:00,2014-06-30 17:03:00,4.633333333333334,84.0,cloudy,37,19.0,192,39.0
2334946,Male,2014-06-29 15:33:00,2014-06-29 15:47:00,13.3,82.9,cloudy,343,15.0,177,15.0
2346480,Male,2014-06-30 10:17:00,2014-06-30 10:42:00,24.88333333333333,78.1,cloudy,272,11.0,284,23.0
2335266,Female,2014-06-29 15:47:00,2014-06-29 15:57:00,9.633333333333333,82.9,cloudy,343,15.0,93,15.0
2351856,Male,2014-06-30 17:04:00,2014-06-30 17:13:00,9.333333333333334,84.0,cloudy,195,31.0,91,31.0
2339206,Female,2014-06-29 18:42:00,2014-06-29 19:05:00,23.65,84.2,cloudy,176,19.0,329,15.0
2343736,Male,2014-06-30 07:25:00,2014-06-30 07:37:00,11.983333333333333,73.0,cloudy,291,19.0,52,31.0
2342166,Female,2014-06-29 22:03:00,2014-06-29 22:10:00,6.6,78.1,cloudy,326,11.0,242,15.0
2347637,Male,2014-06-30 12:09:00,2014-06-30 12:22:00,12.55,82.0,cloudy,24,15.0,91,31.0
2351226,Male,2014-06-30 16:42:00,2014-06-30 17:00:00,17.266666666666666,84.0,cloudy,51,31.0,61,15.0
2346964,Male,2014-06-30 11:09:00,2014-06-30 11:13:00,4.066666666666666,79.0,cloudy,196,19.0,47,19.0
2346583,Male,2014-06-30 10:31:00,2014-06-30 10:46:00,14.5,78.1,cloudy,207,15.0,108,19.0
2346569,Male,2014-06-30 10:30:00,2014-06-30 10:56:00,25.73333333333333,78.1,cloudy,313,19.0,313,19.0
2340978,Male,2014-06-29 20:22:00,2014-06-29 20:41:00,19.016666666666666,81.0,cloudy,268,15.0,268,15.0
2336751,Male,2014-06-29 16:44:00,2014-06-29 16:53:00,8.683333333333334,84.9,cloudy,301,19.0,94,19.0
2338598,Male,2014-06-29 18:08:00,2014-06-29 18:17:00,8.316666666666666,84.2,cloudy,330,19.0,114,27.0
2333013,Male,2014-06-29 14:24:00,2014-06-29 14:33:00,8.733333333333333,84.0,cloudy,250,19.0,156,15.0
2345612,Male,2014-06-30 08:59:00,2014-06-30 09:28:00,29.11666666666667,75.0,cloudy,119,19.0,20,15.0
2341697,Female,2014-06-29 21:13:00,2014-06-29 21:23:00,10.133333333333333,79.0,cloudy,144,15.0,115,23.0
2346246,Male,2014-06-30 09:50:00,2014-06-30 10:04:00,14.4,78.1,cloudy,168,19.0,287,27.0
2352797,Male,2014-06-30 17:34:00,2014-06-30 17:43:00,8.516666666666667,84.0,cloudy,286,23.0,90,35.0
2351388,Male,2014-06-30 16:49:00,2014-06-30 17:09:00,19.866666666666667,84.0,cloudy,66,19.0,273,15.0
2342788,Male,2014-06-29 23:53:00,2014-06-30 00:06:00,12.8,78.1,cloudy,67,15.0,117,23.0
2331967,Male,2014-06-29 13:49:00,2014-06-29 13:53:00,4.633333333333334,82.9,cloudy,195,31.0,51,31.0
2343373,Male,2014-06-30 06:49:00,2014-06-30 07:00:00,10.533333333333333,73.0,cloudy,174,23.0,26,31.0
2345782,Male,2014-06-30 09:11:00,2014-06-30 09:23:00,11.333333333333336,78.1,cloudy,199,15.0,283,23.0
2351848,Male,2014-06-30 17:03:00,2014-06-30 17:26:00,23.266666666666666,84.0,cloudy,286,23.0,25,23.0
2351827,Male,2014-06-30 17:03:00,2014-06-30 17:06:00,3.6666666666666665,84.0,cloudy,134,19.0,192,39.0
2353711,Male,2014-06-30 18:09:00,2014-06-30 18:19:00,9.3,82.0,tstorms,69,19.0,123,15.0
2351929,Male,2014-06-30 17:06:00,2014-06-30 17:12:00,5.783333333333332,84.0,cloudy,158,15.0,16,11.0
2351752,Male,2014-06-30 16:57:00,2014-06-30 17:18:00,21.133333333333333,84.0,cloudy,75,23.0,305,15.0
2343398,Male,2014-06-30 06:53:00,2014-06-30 06:57:00,3.566666666666667,73.0,cloudy,192,39.0,283,23.0
2343962,Male,2014-06-30 07:38:00,2014-06-30 07:40:00,2.3,73.0,cloudy,93,15.0,60,19.0
2344075,Male,2014-06-30 07:47:00,2014-06-30 07:54:00,6.85,73.0,cloudy,66,19.0,47,19.0
2332299,Male,2014-06-29 14:00:00,2014-06-29 14:17:00,16.333333333333332,84.0,cloudy,177,15.0,249,15.0
2343187,Male,2014-06-30 06:28:00,2014-06-30 06:38:00,10.033333333333333,73.0,cloudy,190,15.0,20,15.0
2344731,Male,2014-06-30 08:20:00,2014-06-30 08:32:00,12.45,75.0,cloudy,75,23.0,51,31.0
2350115,Male,2014-06-30 15:35:00,2014-06-30 15:56:00,21.48333333333333,84.9,cloudy,36,31.0,350,15.0
2350788,Male,2014-06-30 16:20:00,2014-06-30 16:32:00,11.95,84.0,cloudy,149,11.0,149,11.0
2344797,Male,2014-06-30 08:22:00,2014-06-30 08:25:00,3.3,75.0,cloudy,316,19.0,344,15.0
2342324,Male,2014-06-29 22:26:00,2014-06-29 22:38:00,11.233333333333333,78.1,cloudy,120,15.0,280,11.0
2333178,Female,2014-06-29 14:30:00,2014-06-29 14:53:00,23.08333333333333,84.0,cloudy,150,11.0,247,15.0
2346655,Male,2014-06-30 10:37:00,2014-06-30 11:23:00,46.48333333333333,78.1,cloudy,157,15.0,164,23.0
2343777,Male,2014-06-30 07:29:00,2014-06-30 07:45:00,15.7,73.0,cloudy,192,39.0,120,15.0
2354481,Female,2014-06-30 18:55:00,2014-06-30 19:19:00,23.9,82.0,tstorms,232,23.0,156,15.0
2351837,Male,2014-06-30 17:03:00,2014-06-30 17:19:00,15.733333333333333,84.0,cloudy,28,15.0,350,15.0
2350694,Male,2014-06-30 16:14:00,2014-06-30 16:18:00,4.5,84.0,cloudy,181,31.0,111,19.0
2333345,Male,2014-06-29 14:36:00,2014-06-29 14:57:00,20.933333333333334,84.0,cloudy,177,15.0,312,15.0
2352571,Male,2014-06-30 17:26:00,2014-06-30 17:38:00,12.366666666666667,84.0,cloudy,93,15.0,258,19.0
2351868,Male,2014-06-30 17:04:00,2014-06-30 17:08:00,4.116666666666666,84.0,cloudy,195,31.0,43,43.0
2347621,Male,2014-06-30 12:09:00,2014-06-30 12:33:00,24.86666666666667,82.0,cloudy,77,23.0,301,19.0
2353457,Male,2014-06-30 17:58:00,2014-06-30 18:08:00,9.9,84.0,cloudy,152,15.0,227,15.0
2348064,Male,2014-06-30 12:47:00,2014-06-30 12:54:00,6.966666666666668,82.0,cloudy,198,19.0,84,19.0
2350322,Female,2014-06-30 15:50:00,2014-06-30 16:03:00,12.55,84.9,cloudy,100,23.0,186,15.0
2343897,Male,2014-06-30 07:37:00,2014-06-30 07:54:00,17.2,73.0,cloudy,220,19.0,53,19.0
2339298,Male,2014-06-29 18:46:00,2014-06-29 19:01:00,14.45,84.2,cloudy,260,19.0,130,15.0
2344057,Female,2014-06-30 07:46:00,2014-06-30 07:52:00,5.7,73.0,cloudy,239,15.0,344,15.0
2344853,Male,2014-06-30 08:25:00,2014-06-30 08:39:00,14.433333333333335,75.0,cloudy,48,27.0,134,19.0
2352075,Male,2014-06-30 17:10:00,2014-06-30 17:18:00,8.416666666666666,84.0,cloudy,43,43.0,174,23.0
2350492,Male,2014-06-30 16:01:00,2014-06-30 16:27:00,26.45,84.0,cloudy,100,23.0,127,15.0
2354454,Female,2014-06-30 18:53:00,2014-06-30 19:09:00,16.133333333333333,82.0,tstorms,17,15.0,305,15.0
2337105,Male,2014-06-29 16:59:00,2014-06-29 17:23:00,24.216666666666665,84.9,cloudy,177,15.0,177,15.0
2340733,Male,2014-06-29 20:04:00,2014-06-29 20:20:00,15.333333333333336,81.0,cloudy,35,39.0,45,15.0
2334807,Male,2014-06-29 15:28:00,2014-06-29 15:36:00,7.65,82.9,cloudy,205,15.0,14,15.0
2351822,Male,2014-06-30 17:03:00,2014-06-30 17:15:00,12.516666666666667,84.0,cloudy,261,15.0,77,23.0
2343839,Male,2014-06-30 07:34:00,2014-06-30 07:40:00,5.916666666666668,73.0,cloudy,165,19.0,117,23.0
2344144,Male,2014-06-30 07:50:00,2014-06-30 07:55:00,5.1,73.0,cloudy,130,15.0,213,15.0
2340841,Male,2014-06-29 20:12:00,2014-06-29 20:21:00,9.45,81.0,cloudy,174,23.0,22,15.0
2345813,Male,2014-06-30 09:13:00,2014-06-30 09:22:00,9.2,78.1,cloudy,66,19.0,48,27.0
2340464,Male,2014-06-29 19:50:00,2014-06-29 19:59:00,8.116666666666667,82.0,cloudy,110,23.0,26,31.0
2347532,Male,2014-06-30 12:01:00,2014-06-30 12:14:00,13.3,82.0,cloudy,51,31.0,255,31.0
2345015,Male,2014-06-30 08:31:00,2014-06-30 08:58:00,26.66666666666667,75.0,cloudy,94,19.0,174,23.0
2338571,Female,2014-06-29 18:08:00,2014-06-29 18:28:00,20.45,84.2,cloudy,258,19.0,289,19.0
2352362,Male,2014-06-30 17:19:00,2014-06-30 17:30:00,11.433333333333335,84.0,cloudy,100,23.0,175,19.0
2353464,Male,2014-06-30 17:58:00,2014-06-30 18:02:00,3.616666666666666,84.0,cloudy,69,19.0,315,11.0
2353853,Female,2014-06-30 18:16:00,2014-06-30 18:23:00,6.666666666666668,82.0,tstorms,113,15.0,144,15.0
2352361,Male,2014-06-30 17:18:00,2014-06-30 17:34:00,15.35,84.0,cloudy,287,27.0,28,15.0
2337845,Female,2014-06-29 17:32:00,2014-06-29 17:37:00,4.65,84.9,cloudy,17,15.0,183,15.0
2345287,Male,2014-06-30 08:43:00,2014-06-30 08:47:00,3.583333333333333,75.0,cloudy,343,15.0,67,15.0
2346267,Male,2014-06-30 09:53:00,2014-06-30 10:13:00,20.33333333333333,78.1,cloudy,141,23.0,37,19.0
2335819,Male,2014-06-29 16:08:00,2014-06-29 16:21:00,12.866666666666667,84.9,cloudy,289,19.0,152,15.0
2354722,Male,2014-06-30 19:53:00,2014-06-30 20:04:00,11.2,73.0,tstorms,135,11.0,278,15.0
2339292,Female,2014-06-29 18:46:00,2014-06-29 18:57:00,11.2,84.2,cloudy,274,15.0,57,15.0
2346354,Male,2014-06-30 10:05:00,2014-06-30 10:15:00,10.15,78.1,cloudy,146,11.0,50,27.0
2347951,Female,2014-06-30 12:37:00,2014-06-30 13:05:00,27.53333333333333,82.0,cloudy,196,19.0,349,15.0
2339736,Male,2014-06-29 19:11:00,2014-06-29 19:24:00,13.733333333333333,82.0,cloudy,198,19.0,261,15.0
2335941,Male,2014-06-29 16:12:00,2014-06-29 16:29:00,16.583333333333332,84.9,cloudy,97,35.0,35,39.0
2349387,Female,2014-06-30 14:38:00,2014-06-30 14:52:00,13.566666666666665,87.1,cloudy,37,19.0,59,19.0
2352072,Female,2014-06-30 17:10:00,2014-06-30 17:17:00,7.333333333333332,84.0,cloudy,98,15.0,321,19.0
2353726,Female,2014-06-30 18:06:00,2014-06-30 18:13:00,7.25,82.0,tstorms,58,19.0,210,19.0
2332642,Female,2014-06-29 14:11:00,2014-06-29 14:20:00,8.85,84.0,cloudy,315,11.0,128,15.0
2354460,Male,2014-06-30 18:54:00,2014-06-30 18:59:00,5.416666666666668,82.0,tstorms,51,31.0,52,31.0
2332204,Male,2014-06-29 13:57:00,2014-06-29 14:12:00,15.833333333333336,82.9,cloudy,291,19.0,13,19.0
2340677,Female,2014-06-29 20:02:00,2014-06-29 20:06:00,4.466666666666667,81.0,cloudy,181,31.0,110,23.0
2340045,Male,2014-06-29 19:27:00,2014-06-29 19:46:00,18.566666666666666,82.0,cloudy,233,15.0,61,15.0
2336433,Female,2014-06-29 16:31:00,2014-06-29 16:40:00,9.416666666666666,84.9,cloudy,260,19.0,259,15.0
2332489,Female,2014-06-29 14:07:00,2014-06-29 14:26:00,19.016666666666666,84.0,cloudy,76,39.0,273,15.0
2351029,Male,2014-06-30 16:32:00,2014-06-30 16:42:00,10.15,84.0,cloudy,110,23.0,91,31.0
2352926,Male,2014-06-30 17:38:00,2014-06-30 17:54:00,15.616666666666667,84.0,cloudy,199,15.0,60,19.0
2351736,Female,2014-06-30 17:00:00,2014-06-30 17:19:00,19.2,84.0,cloudy,100,23.0,340,15.0
2343715,Male,2014-06-30 07:24:00,2014-06-30 07:34:00,10.266666666666667,73.0,cloudy,91,31.0,195,31.0
2352847,Male,2014-06-30 17:36:00,2014-06-30 17:45:00,9.033333333333333,84.0,cloudy,286,23.0,91,31.0
2351793,Female,2014-06-30 17:01:00,2014-06-30 17:17:00,15.916666666666664,84.0,cloudy,69,19.0,228,11.0
2347166,Male,2014-06-30 11:29:00,2014-06-30 11:36:00,7.3,79.0,cloudy,236,15.0,48,27.0
2350257,Male,2014-06-30 15:46:00,2014-06-30 15:52:00,6.283333333333332,84.9,cloudy,250,19.0,115,23.0
2351478,Male,2014-06-30 16:52:00,2014-06-30 17:04:00,11.55,84.0,cloudy,48,27.0,192,39.0
2345866,Male,2014-06-30 09:16:00,2014-06-30 09:22:00,6.516666666666668,78.1,cloudy,77,23.0,37,19.0
2335695,Male,2014-06-29 15:51:00,2014-06-29 16:03:00,11.95,82.9,cloudy,254,15.0,256,15.0
2338304,Male,2014-06-29 17:56:00,2014-06-29 18:11:00,14.683333333333335,84.9,cloudy,51,31.0,268,15.0
2346323,Female,2014-06-30 09:59:00,2014-06-30 10:52:00,52.51666666666666,78.1,cloudy,294,15.0,35,39.0
2350321,Female,2014-06-30 15:50:00,2014-06-30 16:08:00,17.85,84.9,cloudy,173,15.0,91,31.0
2354762,Male,2014-06-30 20:08:00,2014-06-30 20:28:00,20.316666666666666,70.0,rain or snow,148,11.0,171,11.0
2342510,Male,2014-06-29 22:54:00,2014-06-29 23:07:00,12.75,78.1,cloudy,93,15.0,228,11.0
2347885,Male,2014-06-30 12:32:00,2014-06-30 12:46:00,13.916666666666664,82.0,cloudy,100,23.0,35,39.0
2337576,Male,2014-06-29 17:20:00,2014-06-29 17:46:00,26.18333333333333,84.9,cloudy,334,19.0,118,19.0
2349426,Male,2014-06-30 14:42:00,2014-06-30 14:56:00,14.7,87.1,cloudy,176,19.0,127,15.0
2353938,Male,2014-06-30 18:21:00,2014-06-30 18:33:00,11.95,82.0,tstorms,316,19.0,242,15.0
2342152,Male,2014-06-29 22:02:00,2014-06-29 22:18:00,16.533333333333335,78.1,cloudy,59,19.0,15,15.0
2344840,Male,2014-06-30 08:24:00,2014-06-30 08:32:00,8.516666666666667,75.0,cloudy,192,39.0,32,19.0
2345628,Female,2014-06-30 09:02:00,2014-06-30 09:16:00,14.4,78.1,cloudy,329,15.0,301,19.0
2351063,Female,2014-06-30 16:33:00,2014-06-30 16:43:00,10.05,84.0,cloudy,317,15.0,21,15.0
2338694,Male,2014-06-29 18:14:00,2014-06-29 18:26:00,12.116666666666667,84.2,cloudy,46,19.0,286,23.0
2354535,Male,2014-06-30 19:01:00,2014-06-30 19:08:00,6.2333333333333325,73.0,tstorms,81,39.0,181,31.0
2345366,Female,2014-06-30 08:47:00,2014-06-30 09:15:00,27.63333333333333,75.0,cloudy,157,15.0,100,23.0
2354086,Male,2014-06-30 18:30:00,2014-06-30 18:46:00,16.7,82.0,tstorms,177,15.0,251,15.0
2349564,Female,2014-06-30 14:55:00,2014-06-30 15:02:00,7.566666666666666,87.1,cloudy,175,19.0,283,23.0
2348380,Male,2014-06-30 13:13:00,2014-06-30 13:33:00,19.7,84.9,cloudy,160,15.0,160,15.0
2345755,Male,2014-06-30 09:10:00,2014-06-30 09:13:00,2.8,78.1,cloudy,289,19.0,118,19.0
2344779,Female,2014-06-30 08:21:00,2014-06-30 08:35:00,14.133333333333333,75.0,cloudy,46,19.0,37,19.0
2333051,Female,2014-06-29 14:25:00,2014-06-29 14:40:00,14.883333333333333,84.0,cloudy,177,15.0,232,23.0
2345070,Male,2014-06-30 08:33:00,2014-06-30 08:54:00,20.733333333333334,75.0,cloudy,334,19.0,106
gitextract_0qfvetgz/ ├── .codecov.yml ├── .github/ │ ├── ISSUE_TEMPLATE/ │ │ ├── blank_issue.md │ │ ├── bug_report.md │ │ ├── config.yml │ │ ├── documentation_improvement.md │ │ └── feature_request.md │ ├── auto_assign.yml │ └── workflows/ │ ├── auto_approve_dependency_PRs.yml │ ├── build_docs.yml │ ├── create_feedstock_pr.yaml │ ├── install_test.yml │ ├── latest_dependency_checker.yml │ ├── lint_check.yml │ ├── release.yml │ ├── release_notes_updated.yml │ └── unit_tests_with_latest_deps.yml ├── .gitignore ├── .pre-commit-config.yaml ├── .readthedocs.yaml ├── LICENSE ├── Makefile ├── README.md ├── composeml/ │ ├── __init__.py │ ├── conftest.py │ ├── data_slice/ │ │ ├── __init__.py │ │ ├── extension.py │ │ ├── generator.py │ │ └── offset.py │ ├── demos/ │ │ ├── __init__.py │ │ └── transactions.csv │ ├── label_maker.py │ ├── label_search.py │ ├── label_times/ │ │ ├── __init__.py │ │ ├── description.py │ │ ├── deserialize.py │ │ ├── object.py │ │ └── plots.py │ ├── tests/ │ │ ├── __init__.py │ │ ├── requirement_files/ │ │ │ ├── latest_core_dependencies.txt │ │ │ ├── minimum_core_requirements.txt │ │ │ └── minimum_test_requirements.txt │ │ ├── test_data_slice/ │ │ │ ├── __init__.py │ │ │ ├── test_extension.py │ │ │ └── test_offset.py │ │ ├── test_datasets.py │ │ ├── test_featuretools.py │ │ ├── test_label_maker.py │ │ ├── test_label_plots.py │ │ ├── test_label_serialization.py │ │ ├── test_label_times.py │ │ ├── test_label_transforms/ │ │ │ ├── __init__.py │ │ │ ├── test_bin.py │ │ │ ├── test_lead.py │ │ │ ├── test_sample.py │ │ │ └── test_threshold.py │ │ ├── test_version.py │ │ └── utils.py │ ├── update_checker.py │ └── version.py ├── contributing.md ├── docs/ │ ├── Makefile │ ├── make.bat │ └── source/ │ ├── _static/ │ │ └── style.css │ ├── _templates/ │ │ ├── class.rst │ │ └── layout.html │ ├── api_reference.rst │ ├── conf.py │ ├── examples/ │ │ ├── demo/ │ │ │ ├── __init__.py │ │ │ ├── chicago_bike/ │ │ │ │ ├── __init__.py │ │ │ │ └── sample.csv │ │ │ ├── next_purchase/ │ │ │ │ ├── __init__.py │ │ │ │ └── sample.csv │ │ │ ├── turbofan_degredation/ │ │ │ │ ├── __init__.py │ │ │ │ └── sample.csv │ │ │ └── utils.py │ │ ├── predict_bike_trips.ipynb │ │ ├── predict_next_purchase.ipynb │ │ └── predict_turbofan_degredation.ipynb │ ├── images/ │ │ ├── innovation_labs.xml │ │ ├── label-maker.xml │ │ ├── labeling-function.xml │ │ └── workflow.xml │ ├── index.rst │ ├── install.md │ ├── release_notes.rst │ ├── resources/ │ │ ├── faq.ipynb │ │ └── help.rst │ ├── resources.rst │ ├── start.ipynb │ ├── tutorials.rst │ ├── user_guide/ │ │ ├── controlling_cutoff_times.ipynb │ │ ├── data_slice_generator.ipynb │ │ └── using_label_transforms.ipynb │ └── user_guide.rst ├── pyproject.toml └── release.md
SYMBOL INDEX (221 symbols across 30 files)
FILE: composeml/conftest.py
function transactions (line 9) | def transactions():
function total_spent_fn (line 29) | def total_spent_fn():
function unique_amounts_fn (line 38) | def unique_amounts_fn():
function total_spent (line 46) | def total_spent():
function labels (line 77) | def labels():
function add_labels (line 117) | def add_labels(doctest_namespace, labels):
FILE: composeml/data_slice/extension.py
class DataSliceContext (line 6) | class DataSliceContext:
method __init__ (line 9) | def __init__(
method __repr__ (line 29) | def __repr__(self):
method _series (line 34) | def _series(self):
method count (line 42) | def count(self):
method start (line 47) | def start(self):
method stop (line 52) | def stop(self):
class DataSliceFrame (line 57) | class DataSliceFrame(pd.DataFrame):
method _constructor (line 63) | def _constructor(self):
method ctx (line 67) | def ctx(self):
class DataSliceExtension (line 73) | class DataSliceExtension:
method __init__ (line 74) | def __init__(self, df):
method __call__ (line 77) | def __call__(self, size=None, start=None, stop=None, step=None, drop_e...
method __getitem__ (line 96) | def __getitem__(self, offset):
method _apply (line 102) | def _apply(self, size, start, stop, step, drop_empty=True):
method _apply_size (line 121) | def _apply_size(self, df, start, size):
method _apply_start (line 144) | def _apply_start(self, df, start, step):
method _apply_step (line 160) | def _apply_step(self, df, start, step):
method _check_index (line 172) | def _check_index(self):
method _check_offsets (line 180) | def _check_offsets(self, size, start, stop, step):
method _check_size (line 194) | def _check_size(self, size):
method _check_start (line 202) | def _check_start(self, start):
method _check_step (line 212) | def _check_step(self, step):
method _check_stop (line 220) | def _check_stop(self, stop):
method _get_index (line 237) | def _get_index(self, df, i):
method _is_sorted (line 243) | def _is_sorted(self):
method _is_time_index (line 248) | def _is_time_index(self):
FILE: composeml/data_slice/generator.py
class DataSliceGenerator (line 4) | class DataSliceGenerator:
method __init__ (line 7) | def __init__(
method __call__ (line 21) | def __call__(self, df):
method _slice_by_column (line 28) | def _slice_by_column(self, df):
method _slice_by_time (line 45) | def _slice_by_time(self, df):
FILE: composeml/data_slice/offset.py
class DataSliceOffset (line 6) | class DataSliceOffset:
method __init__ (line 9) | def __init__(self, value):
method _check (line 13) | def _check(self):
method _is_offset_base (line 20) | def _is_offset_base(self):
method _is_offset_position (line 25) | def _is_offset_position(self):
method _is_offset_timedelta (line 30) | def _is_offset_timedelta(self):
method _is_offset_timestamp (line 35) | def _is_offset_timestamp(self):
method _is_offset_frequency (line 40) | def _is_offset_frequency(self):
method __int__ (line 46) | def __int__(self):
method __float__ (line 57) | def __float__(self):
method _is_positive (line 65) | def _is_positive(self):
method _is_valid_offset (line 72) | def _is_valid_offset(self):
method _invalid_offset_error (line 80) | def _invalid_offset_error(self):
method _parse_offset_alias (line 89) | def _parse_offset_alias(self, alias):
method _parse_offset_alias_phrase (line 95) | def _parse_offset_alias_phrase(self, value):
method _parse_value (line 110) | def _parse_value(self):
method _parsers (line 122) | def _parsers(self):
class DataSliceStep (line 127) | class DataSliceStep(DataSliceOffset):
method _is_valid_offset (line 129) | def _is_valid_offset(self):
method _parsers (line 136) | def _parsers(self):
FILE: composeml/demos/__init__.py
function load_transactions (line 8) | def load_transactions():
FILE: composeml/label_maker.py
class LabelMaker (line 12) | class LabelMaker:
method __init__ (line 15) | def __init__(
method _name_labeling_function (line 37) | def _name_labeling_function(self, function):
method _check_labeling_function (line 42) | def _check_labeling_function(self, function, name=None):
method labeling_function (line 48) | def labeling_function(self):
method labeling_function (line 53) | def labeling_function(self, value):
method _check_cutoff_time (line 78) | def _check_cutoff_time(self, value):
method slice (line 87) | def slice(
method _bar_format (line 147) | def _bar_format(self):
method _check_example_count (line 155) | def _check_example_count(self, num_examples_per_instance, gap):
method search (line 163) | def search(
method set_index (line 295) | def set_index(self, df):
FILE: composeml/label_search.py
class ExampleSearch (line 6) | class ExampleSearch:
method __init__ (line 13) | def __init__(self, expected_count):
method _check_number (line 18) | def _check_number(n):
method _is_finite_number (line 28) | def _is_finite_number(n):
method is_complete (line 33) | def is_complete(self):
method is_finite (line 38) | def is_finite(self):
method is_valid_labels (line 42) | def is_valid_labels(self, labels):
method reset_count (line 46) | def reset_count(self):
method update_count (line 50) | def update_count(self, labels):
class LabelSearch (line 55) | class LabelSearch(ExampleSearch):
method __init__ (line 63) | def __init__(self, expected_label_counts):
method is_complete (line 72) | def is_complete(self):
method is_complete_label (line 76) | def is_complete_label(self, label):
method is_valid_labels (line 82) | def is_valid_labels(self, labels):
method reset_count (line 105) | def reset_count(self):
method update_count (line 109) | def update_count(self, labels):
FILE: composeml/label_times/description.py
function describe_label_times (line 4) | def describe_label_times(label_times):
FILE: composeml/label_times/deserialize.py
function read_config (line 9) | def read_config(path):
function read_data (line 19) | def read_data(path):
function read_label_times (line 36) | def read_label_times(path, load_settings=True):
FILE: composeml/label_times/object.py
class LabelTimes (line 13) | class LabelTimes(pd.DataFrame):
method __init__ (line 16) | def __init__(
method _assert_single_target (line 38) | def _assert_single_target(self):
method _check_target_columns (line 43) | def _check_target_columns(self):
method _check_target_types (line 52) | def _check_target_types(self):
method _check_label_times (line 64) | def _check_label_times(self):
method _infer_target_columns (line 69) | def _infer_target_columns(self):
method _is_single_target (line 82) | def _is_single_target(self):
method _get_target_type (line 85) | def _get_target_type(self, dtype):
method _infer_target_types (line 92) | def _infer_target_types(self):
method select (line 102) | def select(self, target):
method settings (line 147) | def settings(self):
method is_discrete (line 162) | def is_discrete(self):
method distribution (line 167) | def distribution(self):
method count (line 181) | def count(self):
method count_by_time (line 190) | def count_by_time(self):
method describe (line 209) | def describe(self):
method copy (line 215) | def copy(self, deep=True):
method threshold (line 233) | def threshold(self, value, inplace=False):
method apply_lead (line 255) | def apply_lead(self, value, inplace=False):
method bin (line 274) | def bin(self, bins, quantiles=False, labels=None, right=True, precisio...
method _sample (line 397) | def _sample(self, key, value, settings, random_state=None, replace=Fal...
method _sample_per_label (line 415) | def _sample_per_label(self, key, value, settings, random_state=None, r...
method sample (line 448) | def sample(
method equals (line 566) | def equals(self, other, **kwargs):
method _save_settings (line 580) | def _save_settings(self, path):
method to_csv (line 594) | def to_csv(self, path, save_settings=True, **kwargs):
method to_parquet (line 609) | def to_parquet(self, path, save_settings=True, **kwargs):
method to_pickle (line 624) | def to_pickle(self, path, save_settings=True, **kwargs):
method __finalize__ (line 651) | def __finalize__(self, other, method=None, **kwargs):
method _constructor (line 670) | def _constructor(self):
FILE: composeml/label_times/plots.py
class LabelPlots (line 14) | class LabelPlots:
method __init__ (line 17) | def __init__(self, label_times):
method count_by_time (line 25) | def count_by_time(self, ax=None, **kwargs):
method dist (line 78) | def dist(self):
method distribution (line 82) | def distribution(self, **kwargs):
FILE: composeml/tests/test_data_slice/test_extension.py
function data_slice (line 8) | def data_slice(transactions):
function test_context (line 18) | def test_context(data_slice):
function test_context_aliases (line 34) | def test_context_aliases(data_slice):
function test_subscriptable_slices (line 51) | def test_subscriptable_slices(transactions, time_based, offsets):
function test_subscriptable_error (line 63) | def test_subscriptable_error(transactions):
function test_time_index_error (line 68) | def test_time_index_error(transactions):
function test_minimum_data_per_group (line 74) | def test_minimum_data_per_group(transactions):
function test_drop_empty (line 86) | def test_drop_empty(transactions):
FILE: composeml/tests/test_data_slice/test_offset.py
function test_numeric_typecast (line 6) | def test_numeric_typecast():
function test_numeric_typecast_errors (line 11) | def test_numeric_typecast_errors():
function test_invalid_value (line 21) | def test_invalid_value():
function test_alias_phrase (line 27) | def test_alias_phrase():
FILE: composeml/tests/test_datasets.py
function transactions (line 7) | def transactions():
function test_transactions (line 11) | def test_transactions(transactions):
FILE: composeml/tests/test_featuretools.py
function total_spent (line 7) | def total_spent(df):
function labels (line 13) | def labels():
function test_dfs (line 38) | def test_dfs(labels):
FILE: composeml/tests/test_label_maker.py
function test_search_default (line 8) | def test_search_default(transactions, total_spent_fn):
function test_search_examples_per_label (line 29) | def test_search_examples_per_label(transactions, total_spent_fn):
function test_search_with_undefined_labels (line 57) | def test_search_with_undefined_labels(transactions, total_spent_fn):
function test_search_with_multiple_targets (line 85) | def test_search_with_multiple_targets(transactions, total_spent_fn, uniq...
function test_search_offset_mix_0 (line 127) | def test_search_offset_mix_0(transactions, total_spent_fn):
function test_search_offset_mix_1 (line 158) | def test_search_offset_mix_1(transactions, total_spent_fn):
function test_search_offset_mix_2 (line 188) | def test_search_offset_mix_2(transactions, total_spent_fn):
function test_search_offset_mix_3 (line 217) | def test_search_offset_mix_3(transactions, total_spent_fn):
function test_search_offset_mix_4 (line 254) | def test_search_offset_mix_4(transactions, total_spent_fn):
function test_search_offset_mix_5 (line 287) | def test_search_offset_mix_5(transactions, total_spent_fn):
function test_search_offset_mix_6 (line 319) | def test_search_offset_mix_6(transactions, total_spent_fn):
function test_search_offset_mix_7 (line 347) | def test_search_offset_mix_7(transactions, total_spent_fn):
function test_search_offset_negative_0 (line 377) | def test_search_offset_negative_0(transactions, total_spent_fn):
function test_search_offset_negative_1 (line 395) | def test_search_offset_negative_1(transactions, total_spent_fn):
function test_search_invalid_n_examples (line 413) | def test_search_invalid_n_examples(transactions, total_spent_fn):
function test_column_based_windows (line 427) | def test_column_based_windows(transactions, total_spent_fn):
function test_search_with_invalid_index (line 454) | def test_search_with_invalid_index(transactions, total_spent_fn):
function test_search_on_empty_labels (line 473) | def test_search_on_empty_labels(transactions):
function test_data_slice_overlap (line 491) | def test_data_slice_overlap(transactions, total_spent_fn):
function test_label_type (line 504) | def test_label_type(transactions, total_spent_fn):
function test_search_with_maximum_data (line 515) | def test_search_with_maximum_data(transactions):
function test_minimum_data_per_group (line 578) | def test_minimum_data_per_group(transactions, minimum_data):
function test_minimum_data_per_group_error (line 598) | def test_minimum_data_per_group_error(transactions):
function test_label_maker_categorical_target_with_missing_data (line 613) | def test_label_maker_categorical_target_with_missing_data(transactions, ...
FILE: composeml/tests/test_label_plots.py
function test_count_by_time_categorical (line 4) | def test_count_by_time_categorical(total_spent):
function test_count_by_time_continuous (line 10) | def test_count_by_time_continuous(total_spent):
function test_distribution_categorical (line 15) | def test_distribution_categorical(total_spent):
function test_distribution_continuous (line 21) | def test_distribution_continuous(total_spent):
function test_single_target (line 26) | def test_single_target(total_spent):
FILE: composeml/tests/test_label_serialization.py
function path (line 11) | def path():
function total_spent (line 19) | def total_spent(transactions, total_spent_fn):
function test_csv (line 29) | def test_csv(path, total_spent):
function test_parquet (line 36) | def test_parquet(path, total_spent):
function test_pickle (line 43) | def test_pickle(path, total_spent):
FILE: composeml/tests/test_label_times.py
function test_count_by_time_categorical (line 7) | def test_count_by_time_categorical(total_spent):
function test_count_by_time_continuous (line 28) | def test_count_by_time_continuous(total_spent):
function test_sorted_distribution (line 49) | def test_sorted_distribution(capsys, total_spent):
function test_describe_no_transforms (line 88) | def test_describe_no_transforms(capsys):
function test_distribution_categorical (line 124) | def test_distribution_categorical(total_spent):
function test_distribution_continous (line 138) | def test_distribution_continous(total_spent):
function test_target_type (line 157) | def test_target_type(total_spent):
function test_count (line 165) | def test_count(total_spent):
function test_label_select_errors (line 180) | def test_label_select_errors(total_spent):
FILE: composeml/tests/test_label_transforms/test_bin.py
function test_bins (line 5) | def test_bins(labels):
function test_quantile_bins (line 27) | def test_quantile_bins(labels):
function test_single_target (line 49) | def test_single_target(total_spent):
FILE: composeml/tests/test_label_transforms/test_lead.py
function test_lead (line 4) | def test_lead(labels):
FILE: composeml/tests/test_label_transforms/test_sample.py
function labels (line 8) | def labels(labels):
function test_sample_n_int (line 12) | def test_sample_n_int(labels):
function test_sample_n_per_label (line 26) | def test_sample_n_per_label(labels):
function test_sample_frac_int (line 42) | def test_sample_frac_int(labels):
function test_sample_frac_per_label (line 55) | def test_sample_frac_per_label(labels):
function test_sample_in_transforms (line 71) | def test_sample_in_transforms(labels):
function test_sample_with_replacement (line 88) | def test_sample_with_replacement(labels):
function test_single_target (line 95) | def test_single_target(total_spent):
function test_sample_n_per_instance (line 103) | def test_sample_n_per_instance():
function test_sample_frac_per_instance (line 127) | def test_sample_frac_per_instance():
FILE: composeml/tests/test_label_transforms/test_threshold.py
function test_threshold (line 4) | def test_threshold(labels):
function test_single_target (line 17) | def test_single_target(total_spent):
FILE: composeml/tests/test_version.py
function test_version (line 4) | def test_version():
FILE: composeml/tests/utils.py
function read_csv (line 6) | def read_csv(data, **kwargs):
function to_csv (line 25) | def to_csv(label_times, **kwargs):
FILE: docs/source/conf.py
function setup (line 228) | def setup(app):
FILE: docs/source/examples/demo/chicago_bike/__init__.py
function _read (line 8) | def _read(file):
function load_sample (line 16) | def load_sample():
FILE: docs/source/examples/demo/next_purchase/__init__.py
function _add_time (line 12) | def _add_time(df, start="2015-01-01"):
function _data (line 42) | def _data(nrows=1000000):
function _read (line 62) | def _read(file):
function load_sample (line 68) | def load_sample():
FILE: docs/source/examples/demo/turbofan_degredation/__init__.py
function _download_data (line 9) | def _download_data():
function _data (line 14) | def _data():
function _read (line 27) | def _read(file):
function load_sample (line 33) | def load_sample():
FILE: docs/source/examples/demo/utils.py
function download (line 9) | def download(url, output="data"):
function extract (line 22) | def extract(content, content_type, output):
function extract_tarball (line 31) | def extract_tarball(content, output):
function extract_zip (line 42) | def extract_zip(content, output):
Condensed preview — 96 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (784K chars).
[
{
"path": ".codecov.yml",
"chars": 303,
"preview": "codecov:\n notify:\n require_ci_to_pass: yes\n\ncomment:\n layout: \"diff, files\"\n\ncoverage:\n precision: 2\n round: down"
},
{
"path": ".github/ISSUE_TEMPLATE/blank_issue.md",
"chars": 90,
"preview": "---\nname: Blank Issue\nabout: Create a blank issue\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n"
},
{
"path": ".github/ISSUE_TEMPLATE/bug_report.md",
"chars": 272,
"preview": "---\nname: Bug Report\nabout: Create a bug report to help us improve Compose\ntitle: ''\nlabels: 'bug'\nassignees: ''\n\n---\n\n["
},
{
"path": ".github/ISSUE_TEMPLATE/config.yml",
"chars": 519,
"preview": "blank_issues_enabled: true\ncontact_links:\n - name: General Technical Question\n about: \"If you have a question like *"
},
{
"path": ".github/ISSUE_TEMPLATE/documentation_improvement.md",
"chars": 222,
"preview": "---\nname: Documentation Improvement\nabout: Suggest an idea for improving the documentation\ntitle: ''\nlabels: 'documentat"
},
{
"path": ".github/ISSUE_TEMPLATE/feature_request.md",
"chars": 244,
"preview": "---\nname: Feature Request\nabout: Suggest an idea for this project\ntitle: ''\nlabels: 'new feature'\nassignees: ''\n\n---\n\n- "
},
{
"path": ".github/auto_assign.yml",
"chars": 67,
"preview": "# Set to author to set pr creator as assignee\naddAssignees: author\n"
},
{
"path": ".github/workflows/auto_approve_dependency_PRs.yml",
"chars": 1268,
"preview": "name: Auto Approve Dependency PRs\non:\n schedule:\n - cron: '*/30 * * * *'\n workflow_dispatch:\njobs:\n build:\n r"
},
{
"path": ".github/workflows/build_docs.yml",
"chars": 1075,
"preview": "on:\n pull_request:\n types: [opened, synchronize]\n push:\n branches:\n - main\n\nname: Build Docs\njobs:\n doc_te"
},
{
"path": ".github/workflows/create_feedstock_pr.yaml",
"chars": 2263,
"preview": "name: Create Feedstock PR\non:\n workflow_dispatch:\n inputs:\n version:\n description: 'released PyPI versio"
},
{
"path": ".github/workflows/install_test.yml",
"chars": 1137,
"preview": "on:\n pull_request:\n types: [opened, synchronize]\n push:\n branches:\n - main\n\nname: Install Test\njobs:\n inst"
},
{
"path": ".github/workflows/latest_dependency_checker.yml",
"chars": 1603,
"preview": "# This workflow will install dependenies and if any critical dependencies have changed a pull request\n# will be created "
},
{
"path": ".github/workflows/lint_check.yml",
"chars": 1006,
"preview": "on:\n pull_request:\n types: [opened, synchronize]\n push:\n branches:\n - main\n\nname: Lint Check\njobs:\n lint_t"
},
{
"path": ".github/workflows/release.yml",
"chars": 561,
"preview": "on:\n release:\n types: [published]\n\nname: Release\njobs:\n pypi:\n name: Release to PyPI\n runs-on: ubuntu-latest\n"
},
{
"path": ".github/workflows/release_notes_updated.yml",
"chars": 1273,
"preview": "name: Release Notes Updated\n\non:\n pull_request:\n types: [opened, synchronize]\n\njobs:\n release_notes_updated:\n na"
},
{
"path": ".github/workflows/unit_tests_with_latest_deps.yml",
"chars": 1766,
"preview": "on:\n pull_request:\n types: [opened, synchronize]\n push:\n branches:\n - main\n\nname: Unit Tests - Latest Depen"
},
{
"path": ".gitignore",
"chars": 1302,
"preview": "cb_model.json\n.DS_Store\n\n# IDE\n.vscode\ndocs/source/examples/demo/*/download\n\n# Byte-compiled / optimized / DLL files\n__p"
},
{
"path": ".pre-commit-config.yaml",
"chars": 1173,
"preview": "exclude: |\n (?x)\n .html$|.csv$|.svg$|.md$|.txt$|.json$|.xml$|.pickle$|^.github/|\n (LICENSE.*|README.*)\ndefault_stages"
},
{
"path": ".readthedocs.yaml",
"chars": 573,
"preview": "# .readthedocs.yml\n# Read the Docs configuration file\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html fo"
},
{
"path": "LICENSE",
"chars": 1518,
"preview": "BSD 3-Clause License\n\nCopyright (c) 2017, Feature Labs, Inc.\nAll rights reserved.\n\nRedistribution and use in source and "
},
{
"path": "Makefile",
"chars": 1288,
"preview": ".PHONY: clean\nclean:\n\tfind . -name '*.pyo' -delete\n\tfind . -name '*.pyc' -delete\n\tfind . -name __pycache__ -delete\n\tfind"
},
{
"path": "README.md",
"chars": 7904,
"preview": "<p align=\"center\"><img width=50% src=\"https://raw.githubusercontent.com/alteryx/compose/main/docs/source/images/compose."
},
{
"path": "composeml/__init__.py",
"chars": 208,
"preview": "# flake8:noqa\nfrom composeml.version import __version__\nfrom composeml import demos, update_checker\nfrom composeml.label"
},
{
"path": "composeml/conftest.py",
"chars": 2940,
"preview": "import pandas as pd\nimport pytest\n\nfrom composeml import LabelTimes\nfrom composeml.tests.utils import read_csv\n\n\n@pytest"
},
{
"path": "composeml/data_slice/__init__.py",
"chars": 76,
"preview": "# flake8:noqa\nfrom composeml.data_slice.generator import DataSliceGenerator\n"
},
{
"path": "composeml/data_slice/extension.py",
"chars": 8768,
"preview": "import pandas as pd\n\nfrom composeml.data_slice.offset import DataSliceOffset, DataSliceStep\n\n\nclass DataSliceContext:\n "
},
{
"path": "composeml/data_slice/generator.py",
"chars": 1689,
"preview": "from composeml.data_slice.extension import DataSliceContext, DataSliceFrame\n\n\nclass DataSliceGenerator:\n \"\"\"Generates"
},
{
"path": "composeml/data_slice/offset.py",
"chars": 4215,
"preview": "import re\n\nimport pandas as pd\n\n\nclass DataSliceOffset:\n \"\"\"Offsets for calculating data slice indices.\"\"\"\n\n def _"
},
{
"path": "composeml/demos/__init__.py",
"chars": 231,
"preview": "import os\n\nimport pandas as pd\n\nDATA = os.path.join(os.path.dirname(__file__))\n\n\ndef load_transactions():\n path = os."
},
{
"path": "composeml/demos/transactions.csv",
"chars": 11348,
"preview": "transaction_id,session_id,transaction_time,product_id,amount,customer_id,device,session_start,zip_code,join_date,date_of"
},
{
"path": "composeml/label_maker.py",
"chars": 12734,
"preview": "from sys import stdout\n\nfrom pandas import Series\nfrom pandas.api.types import is_categorical_dtype\nfrom tqdm import tqd"
},
{
"path": "composeml/label_search.py",
"chars": 4006,
"preview": "from collections import Counter\n\nfrom pandas import isnull\n\n\nclass ExampleSearch:\n \"\"\"A label search based on the num"
},
{
"path": "composeml/label_times/__init__.py",
"chars": 129,
"preview": "# flake8:noqa\nfrom composeml.label_times.deserialize import read_label_times\nfrom composeml.label_times.object import La"
},
{
"path": "composeml/label_times/description.py",
"chars": 1905,
"preview": "import pandas as pd\n\n\ndef describe_label_times(label_times):\n \"\"\"Prints out label info with transform settings that r"
},
{
"path": "composeml/label_times/deserialize.py",
"chars": 1338,
"preview": "import json\nimport os\n\nimport pandas as pd\n\nfrom composeml.label_times.object import LabelTimes\n\n\ndef read_config(path):"
},
{
"path": "composeml/label_times/object.py",
"chars": 23278,
"preview": "import json\nimport os\n\nimport pandas as pd\n\nfrom composeml.label_times.description import describe_label_times\nfrom comp"
},
{
"path": "composeml/label_times/plots.py",
"chars": 2963,
"preview": "import matplotlib as mpl # isort:skip\nimport pandas as pd\nimport seaborn as sns\n\n# Raises an import error on OSX if not"
},
{
"path": "composeml/tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "composeml/tests/requirement_files/latest_core_dependencies.txt",
"chars": 99,
"preview": "featuretools==1.27.0\nmatplotlib==3.7.2\npandas==2.0.3\nseaborn==0.12.2\ntqdm==4.66.1\nwoodwork==0.25.1\n"
},
{
"path": "composeml/tests/requirement_files/minimum_core_requirements.txt",
"chars": 61,
"preview": "matplotlib==3.3.3\npandas==2.0.0\nseaborn==0.12.2\ntqdm==4.32.0\n"
},
{
"path": "composeml/tests/requirement_files/minimum_test_requirements.txt",
"chars": 192,
"preview": "featuretools==1.27.0\nmatplotlib==3.3.3\npandas==2.0.0\npip==21.3.1\npyarrow==7.0.0\npytest-cov==3.0.0\npytest-xdist==2.5.0\npy"
},
{
"path": "composeml/tests/test_data_slice/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "composeml/tests/test_data_slice/test_extension.py",
"chars": 2833,
"preview": "import pandas as pd\nfrom pytest import fixture, mark, raises\n\nfrom composeml import LabelMaker\n\n\n@fixture\ndef data_slice"
},
{
"path": "composeml/tests/test_data_slice/test_offset.py",
"chars": 1028,
"preview": "from pytest import raises\n\nfrom composeml.data_slice.offset import DataSliceOffset\n\n\ndef test_numeric_typecast():\n as"
},
{
"path": "composeml/tests/test_datasets.py",
"chars": 193,
"preview": "import pytest\n\nfrom composeml import demos\n\n\n@pytest.fixture\ndef transactions():\n return demos.load_transactions()\n\n\n"
},
{
"path": "composeml/tests/test_featuretools.py",
"chars": 1442,
"preview": "import featuretools as ft\nimport pytest\n\nfrom composeml import LabelMaker\n\n\ndef total_spent(df):\n total = df.amount.s"
},
{
"path": "composeml/tests/test_label_maker.py",
"chars": 16972,
"preview": "import pandas as pd\nimport pytest\n\nfrom composeml import LabelMaker\nfrom composeml.tests.utils import to_csv\n\n\ndef test_"
},
{
"path": "composeml/tests/test_label_plots.py",
"chars": 1014,
"preview": "from pytest import raises\n\n\ndef test_count_by_time_categorical(total_spent):\n total_spent = total_spent.bin(2, labels"
},
{
"path": "composeml/tests/test_label_serialization.py",
"chars": 1220,
"preview": "import os\nimport shutil\n\nimport pandas as pd\nimport pytest\n\nimport composeml as cp\n\n\n@pytest.fixture\ndef path():\n pwd"
},
{
"path": "composeml/tests/test_label_times.py",
"chars": 4929,
"preview": "from pytest import raises\n\nfrom composeml.label_times import LabelTimes\nfrom composeml.tests.utils import to_csv\n\n\ndef t"
},
{
"path": "composeml/tests/test_label_transforms/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "composeml/tests/test_label_transforms/test_bin.py",
"chars": 1707,
"preview": "import pandas as pd\nfrom pytest import raises\n\n\ndef test_bins(labels):\n given_labels = labels.bin(2)\n transform = "
},
{
"path": "composeml/tests/test_label_transforms/test_lead.py",
"chars": 525,
"preview": "import pandas as pd\n\n\ndef test_lead(labels):\n labels = labels.apply_lead(\"10min\")\n transform = labels.transforms[0"
},
{
"path": "composeml/tests/test_label_transforms/test_sample.py",
"chars": 3725,
"preview": "import pytest\n\nfrom composeml import LabelTimes\nfrom composeml.tests.utils import read_csv, to_csv\n\n\n@pytest.fixture\ndef"
},
{
"path": "composeml/tests/test_label_transforms/test_threshold.py",
"chars": 625,
"preview": "from pytest import raises\n\n\ndef test_threshold(labels):\n labels = labels.threshold(200)\n transform = labels.transf"
},
{
"path": "composeml/tests/test_version.py",
"chars": 91,
"preview": "from composeml import __version__\n\n\ndef test_version():\n assert __version__ == \"0.10.1\"\n"
},
{
"path": "composeml/tests/utils.py",
"chars": 641,
"preview": "from io import StringIO\n\nimport pandas as pd\n\n\ndef read_csv(data, **kwargs):\n \"\"\"Helper function for creating a dataf"
},
{
"path": "composeml/update_checker.py",
"chars": 255,
"preview": "from pkg_resources import iter_entry_points\n\nfor entry_point in iter_entry_points(\"alteryx_open_src_initialize\"):\n tr"
},
{
"path": "composeml/version.py",
"chars": 23,
"preview": "__version__ = \"0.10.1\"\n"
},
{
"path": "contributing.md",
"chars": 5501,
"preview": "# Contributing to Compose\n\n:+1::tada: First off, thank you for taking the time to contribute! :tada::+1:\n\nWhether you ar"
},
{
"path": "docs/Makefile",
"chars": 585,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS =\nSPHI"
},
{
"path": "docs/make.bat",
"chars": 756,
"preview": "@ECHO OFF\n\npushd %~dp0\n\nREM Command file for Sphinx documentation\n\nif \"%SPHINXBUILD%\" == \"\" (\n\tset SPHINXBUILD=sphinx-bu"
},
{
"path": "docs/source/_static/style.css",
"chars": 783,
"preview": ".footer {\n background-color: #0D2345;\n padding-bottom: 40px;\n padding-top: 40px;\n width: 100%;\n}\n\n.footer-ce"
},
{
"path": "docs/source/_templates/class.rst",
"chars": 399,
"preview": "{{ fullname | escape | underline}}\n\n.. currentmodule:: {{ module }}\n\n.. autoclass:: {{ objname }}\n\n {% block methods %"
},
{
"path": "docs/source/_templates/layout.html",
"chars": 2829,
"preview": "{% extends \"!layout.html\" %}\n\n{%- block extrahead %}\n\n<script>\n !function () {\n var analytics = window.analytics = w"
},
{
"path": "docs/source/api_reference.rst",
"chars": 799,
"preview": ".. currentmodule:: composeml\n\n=============\nAPI Reference\n=============\n\nLabel Maker\n===========\n\n.. autosummary::\n :"
},
{
"path": "docs/source/conf.py",
"chars": 6860,
"preview": "# -*- coding: utf-8 -*-\n#\n# Configuration file for the Sphinx documentation builder.\n#\n# This file does only contain a s"
},
{
"path": "docs/source/examples/demo/__init__.py",
"chars": 93,
"preview": "import os\nimport warnings\n\nwarnings.filterwarnings(\"ignore\")\nPWD = os.path.dirname(__file__)\n"
},
{
"path": "docs/source/examples/demo/chicago_bike/__init__.py",
"chars": 306,
"preview": "from demo import PWD\nfrom pandas import read_csv\nfrom os.path import join\n\nPWD = join(PWD, \"chicago_bike\")\n\n\ndef _read(f"
},
{
"path": "docs/source/examples/demo/chicago_bike/sample.csv",
"chars": 97265,
"preview": "trip_id,gender,starttime,stoptime,tripduration,temperature,events,from_station_id,dpcapacity_start,to_station_id,dpcapac"
},
{
"path": "docs/source/examples/demo/next_purchase/__init__.py",
"chars": 2088,
"preview": "import os\nimport pandas as pd\nimport requests\nimport tarfile\nfrom demo import PWD, utils\nfrom tqdm import tqdm\n\nURL = r\""
},
{
"path": "docs/source/examples/demo/next_purchase/sample.csv",
"chars": 22071,
"preview": "id,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,department,user_id,order_time\n24,"
},
{
"path": "docs/source/examples/demo/turbofan_degredation/__init__.py",
"chars": 932,
"preview": "import os\nimport pandas as pd\nfrom demo import utils\n\nURL = r\"https://ti.arc.nasa.gov/c/6/\"\nPWD = os.path.dirname(__file"
},
{
"path": "docs/source/examples/demo/turbofan_degredation/sample.csv",
"chars": 80788,
"preview": "id,engine_no,time_in_cycles,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,senso"
},
{
"path": "docs/source/examples/demo/utils.py",
"chars": 1493,
"preview": "import os\nimport tarfile\nfrom zipfile import ZipFile\n\nimport requests\nfrom tqdm import tqdm\n\n\ndef download(url, output=\""
},
{
"path": "docs/source/examples/predict_bike_trips.ipynb",
"chars": 15528,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Predict Bike Trips\\n\",\n \"\\n\",\n"
},
{
"path": "docs/source/examples/predict_next_purchase.ipynb",
"chars": 16450,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Predict Next Purchase\\n\",\n \"\\n"
},
{
"path": "docs/source/examples/predict_turbofan_degredation.ipynb",
"chars": 17262,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {\n \"raw_mimetype\": \"text/restructuredtext\"\n },\n \"sou"
},
{
"path": "docs/source/images/innovation_labs.xml",
"chars": 59155,
"preview": "<mxfile host=\"Electron\" modified=\"2020-08-28T18:11:21.195Z\" agent=\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) Apple"
},
{
"path": "docs/source/images/label-maker.xml",
"chars": 2402,
"preview": "<mxfile modified=\"2019-07-02T21:30:54.163Z\" host=\"www.draw.io\" agent=\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWeb"
},
{
"path": "docs/source/images/labeling-function.xml",
"chars": 89882,
"preview": "<mxfile modified=\"2019-07-02T21:15:56.017Z\" host=\"www.draw.io\" agent=\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWeb"
},
{
"path": "docs/source/images/workflow.xml",
"chars": 128987,
"preview": "<mxfile modified=\"2020-07-16T17:20:08.201Z\" host=\"Electron\" agent=\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) Apple"
},
{
"path": "docs/source/index.rst",
"chars": 2017,
"preview": "================\nWhat is Compose?\n================\n\n.. toctree::\n :hidden:\n :maxdepth: 1\n\n install\n start\n "
},
{
"path": "docs/source/install.md",
"chars": 1329,
"preview": "# Install\n\nCompose is available for Python 3.8, 3.9, 3.10, and 3.11. It can be installed from [PyPI](https://pypi.org/pr"
},
{
"path": "docs/source/release_notes.rst",
"chars": 10288,
"preview": "Release Notes\n-------------\n\nFuture Release\n==============\n * Enhancements\n * Fixes\n * Changes\n * Remove"
},
{
"path": "docs/source/resources/faq.ipynb",
"chars": 5408,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# FAQ\\n\",\n \"\\n\",\n \"## I have "
},
{
"path": "docs/source/resources/help.rst",
"chars": 1695,
"preview": "====\nHelp\n====\n\nCouldn't find what you were looking for? The Alteryx open source community is happy to provide support t"
},
{
"path": "docs/source/resources.rst",
"chars": 166,
"preview": "=========\nResources\n=========\n\nFrequently asked questions and additional resources\n\n.. toctree::\n :glob:\n :maxdept"
},
{
"path": "docs/source/start.ipynb",
"chars": 6431,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"raw\",\n \"metadata\": {\n \"raw_mimetype\": \"text/restructuredtext\"\n },\n \"source\":"
},
{
"path": "docs/source/tutorials.rst",
"chars": 173,
"preview": "=========\nTutorials\n=========\n\nUse these tutorial to learn how to use Compose for building AutoML applications.\n\n.. toct"
},
{
"path": "docs/source/user_guide/controlling_cutoff_times.ipynb",
"chars": 6546,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"fcfef470\",\n \"metadata\": {},\n \"source\": [\n \"# Controlling "
},
{
"path": "docs/source/user_guide/data_slice_generator.ipynb",
"chars": 14261,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"raw\",\n \"metadata\": {\n \"raw_mimetype\": \"text/restructuredtext\"\n },\n \"source\":"
},
{
"path": "docs/source/user_guide/using_label_transforms.ipynb",
"chars": 10464,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"raw\",\n \"metadata\": {\n \"raw_mimetype\": \"text/restructuredtext\"\n },\n \"source\":"
},
{
"path": "docs/source/user_guide.rst",
"chars": 292,
"preview": "==========\nUser Guide\n==========\n\nUse these guides to learn how to use label transformations and generate better trainin"
},
{
"path": "pyproject.toml",
"chars": 3643,
"preview": "[project]\nname = \"composeml\"\nreadme = \"README.md\"\ndescription = \"a framework for automated prediction engineering\"\ndynam"
},
{
"path": "release.md",
"chars": 1113,
"preview": "# Release Process\n## Prerequisites\nThe environment variables `PYPI_USERNAME` and `PYPI_PASSWORD` must be already set in "
}
]
About this extraction
This page contains the full source code of the alteryx/compose GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 96 files (740.1 KB), approximately 390.3k tokens, and a symbol index with 221 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.